Archive for the ‘wiki-en’ Category

The bot equivalent to the atom bomb was ignited.

28. Dezember 2007

Since Rambot in en.wikipedia back in the old days bot generated articles are a source of constant controversy. The main points remain unchanged since that:

  • Supporters say that a bot generated article is better than none at all and that they attrac new editors.
  • Opponents say that bot generated articles don’t contain encyclopedic content but just databases and that it is frustrating reading a Wikipedia full of substubs instead of a small one with decent articles and that maintenance of bot articles is a nightmare, especially if article numbers grow faster than the community can deal with.
  • Needless to say that I’m an opponent to bot generated articles cause:

    Wikipedia is an encyclopedia which predominantly incorporates summary texts describing the unique features of its items in indvidual texts.

    A bot can never achieve that. However up to now I just grumbled silently as until now the majority of content (not always pages) in any Wikipedia was still written by humans.

    But recently the Volapük Wikipedia decided to ignite the ultimate bot bomb: A Wikipedia almost entirely written by bots. Yes nearly 100% of all bytes in there are from a bot. They defense themselves with comparisons from Rambot to tsca.bot page number cheating and that they have a right to do stupid things cause others did it previous, too… They even argue that single bot articles maybe are longer in bytes than some human written articles and thus suggest these bot articles necessarily contain more information and are thus even „the better articles“ (TM). I could go on…

    Some people thus wanted the Volapük Wikipedia to be closed entirely but failed. I therefore started a more sensible attempt in order to save the good bits there: A request for deleting all minor bot generated articles in Volapük Wikipedia and moving it back to the Wikimedia Incubator. But see yourself how these people manage to discuss things down to the level of supressed minorities and other things that always „work“.

    P.S.: Don’t forget. Please take the time for reading and make your own image prior to posting anywhere any comment about that.

    Werbung

    MediaWiki tricks 3 -Hide your workbench

    3. November 2007

    Robots.txt and related tricks in order to make your wiki look more serious in search engine results among other nice effects.

    You probably want your own wiki project to show up in various search engines among the first places for topics dealt with inside your wiki but you very likely don’t want all your „internal“ debates showing up along with or even prior to your polished wiki articles (that very likely resulted from the corresponding debates and contain the same keywords). And/Or you maybe want to avoid random external people jumping and trolling on „hot“ internal wiki debates but you don’t want to restrict editing in your wiki.

    At first a list of pages you may want to hide from search engines:

  • Your wiki may contain a global project talk page (aka „Village Pump“) that is linked from the sidebar in the left. Global debates are useful for people interested in the project but not for random people that did not visit a single other page of your wiki before. This this page is also usually heavy linked inside the wiki and thus very likely on top of search results.
  • Your project contact page. People that want to contact you should at least know what your project is about. So they should not bother to contact you prior to browsing your wiki. And furthermore email harvesters seem to use search engines for collecting spam victim email adresses. So if your contact email adress is not acessible via a search engine you don’t get that much (or even no) spam without needing to alienate your contact adress (this strategy works perfectly in Pakanto).
  • Further specific global project management pages such as article deletion queues, which might could contain some very rude debates e.g. between the original author and others…
  • Automatically generated „special“ pages, such as the internal search engine. Although they contain an alternative meta information not to index the page and to not follow its links (meta key words „noindex, nofollow“) you can entirely avoid loading these pages by web spiders via a robots.txt file (and thus save some server resources).
  • The same applies to edit fields. Although MediaWiki itself prevents their indexing it is quite time consuming to generate these pages. However if you want to include „edit“ and other parameter URLs as well to robots.txt you need to make a clear distinction between the URL for article views and editing in order to avoid disallowing search engine spiders for your entire wiki. So you need to configure your wiki with some pretty URLs (which is also favourable for external references to your wiki; search engines also rank folder like structures higher than parameter URLs).
  • Every article or page in another namespace has an associated talk page. These talk pages usually get indexed by search engines. Very often they are listed in search results on top, which is odd as you want to show your product and not your work process to external people (and furthermore random garbage in hot internal article debates is no fun). So preventing indexing talk pages of all namespaces is also a good idea. For example the German language Wikipedia did do so with quite some success.
  • One way preventing indexing of pages by polite web spiders is a „robots.txt“ file. Ok how does such a file look like? Take a look at http://pakanto.org/robots.txt and http://en.wikipedia.org/robots.txt and some further hints at MediaWiki: Prevent bots from crawling index.php and Meta: robots.txt.

    If you don’t like the robots.txt approach (for example because you need to take care at domain/URL changes) you can also apply the above mentioned meta tags to certain namespaces and selected single pages via configuring $wgNamespaceRobotPolicies and $wgArticleRobotPolicies in your wiki’s LocalSettings.php.

    Wikipedia uses a combined approach. Single pages get listed in robots.txt and whole namespaces get configured via $wgNamespaceRobotPolicies (for example at the German language Wikipedia for all talk pages). However be aware that if you consider applying the meta tag method to an existing wiki that at least Google does not wipe out old page versions but only newer page versions out of its search index after you applied the meta tag method to these pages. In contrast it will wipe out also old versions if you prevent indexing via robots.txt.

    Investigations on Russian copyright

    23. Juli 2007

    Quite some of you are probably aware of the past copyright duration flame wars and wild assumptions on Soviet and Russian copyright in en.wikipedia, de.wikipedia, Wikimedia Commons and other Wikimedia projects. Lupo was the hero who did dive into this complicated matter and did enlighten this stuck debate with plain facts.

    Now he has produced a real masterpiece out of this debate. He wrote a series of 4 for Wikipedia articles that cover all aspects of this topic. I was in awe, when I read through the articles and am pretty sure that this is the best available compilation of the whole affair of Russian copyright, but read yourself:

  • Copyright in Russia
  • Copyright law of the Soviet Union
  • Copyright law of the Russian Federation
  • International copyright relations of Russia
  • I can only recommend reading them, translating them into other languages and using them as fact basis for your daily copyright work on this matter. I’d love if other people can do such a comprehensive analysis resulting in Wikipedia articles for other nations where we have comparable debates, too.

    A funny side aspect: We extend Wikipedia in order to teach ourselves and keep our project alive. 🙂