MediaWiki tricks 3 -Hide your workbench
Robots.txt and related tricks in order to make your wiki look more serious in search engine results among other nice effects.
You probably want your own wiki project to show up in various search engines among the first places for topics dealt with inside your wiki but you very likely don’t want all your „internal“ debates showing up along with or even prior to your polished wiki articles (that very likely resulted from the corresponding debates and contain the same keywords). And/Or you maybe want to avoid random external people jumping and trolling on „hot“ internal wiki debates but you don’t want to restrict editing in your wiki.
At first a list of pages you may want to hide from search engines:
Your wiki may contain a global project talk page (aka „Village Pump“) that is linked from the sidebar in the left. Global debates are useful for people interested in the project but not for random people that did not visit a single other page of your wiki before. This this page is also usually heavy linked inside the wiki and thus very likely on top of search results.
Your project contact page. People that want to contact you should at least know what your project is about. So they should not bother to contact you prior to browsing your wiki. And furthermore email harvesters seem to use search engines for collecting spam victim email adresses. So if your contact email adress is not acessible via a search engine you don’t get that much (or even no) spam without needing to alienate your contact adress (this strategy works perfectly in Pakanto).
Further specific global project management pages such as article deletion queues, which might could contain some very rude debates e.g. between the original author and others…
Automatically generated „special“ pages, such as the internal search engine. Although they contain an alternative meta information not to index the page and to not follow its links (meta key words „noindex, nofollow“) you can entirely avoid loading these pages by web spiders via a robots.txt file (and thus save some server resources).
The same applies to edit fields. Although MediaWiki itself prevents their indexing it is quite time consuming to generate these pages. However if you want to include „edit“ and other parameter URLs as well to robots.txt you need to make a clear distinction between the URL for article views and editing in order to avoid disallowing search engine spiders for your entire wiki. So you need to configure your wiki with some pretty URLs (which is also favourable for external references to your wiki; search engines also rank folder like structures higher than parameter URLs).
Every article or page in another namespace has an associated talk page. These talk pages usually get indexed by search engines. Very often they are listed in search results on top, which is odd as you want to show your product and not your work process to external people (and furthermore random garbage in hot internal article debates is no fun). So preventing indexing talk pages of all namespaces is also a good idea. For example the German language Wikipedia did do so with quite some success.
One way preventing indexing of pages by polite web spiders is a „robots.txt“ file. Ok how does such a file look like? Take a look at http://pakanto.org/robots.txt and http://en.wikipedia.org/robots.txt and some further hints at MediaWiki: Prevent bots from crawling index.php and Meta: robots.txt.
If you don’t like the robots.txt approach (for example because you need to take care at domain/URL changes) you can also apply the above mentioned meta tags to certain namespaces and selected single pages via configuring $wgNamespaceRobotPolicies and $wgArticleRobotPolicies in your wiki’s LocalSettings.php.
Wikipedia uses a combined approach. Single pages get listed in robots.txt and whole namespaces get configured via $wgNamespaceRobotPolicies (for example at the German language Wikipedia for all talk pages). However be aware that if you consider applying the meta tag method to an existing wiki that at least Google does not wipe out old page versions but only newer page versions out of its search index after you applied the meta tag method to these pages. In contrast it will wipe out also old versions if you prevent indexing via robots.txt.