Delayed Indexing

Moved here from WikiWikiSystemNotice:

Q: Does the <META NAME="ROBOTS" CONTENT="NOINDEX"> in line 2 of the html-source of many WikiPages mean, that these pages are never indexed?

A: No, the line will disappear 10 hours after the last edit. This ensures that only stable (and spam free) pages are indexed by SearchEngines.


All wiki sites are suffering under a deluge of WikiSpam. Unwelcome links are added to pages, for the sole purpose of increasing SearchEngine PageRank. This site, the WikiWikiWeb, now requests that search robots not index pages that have been recently edited. We call this DelayedIndexing.

We encourage all wiki operators to adopt this practice, which is implemented as a single line modification to the wiki script:

  my $meta = '<META NAME="ROBOTS" CONTENT="NOINDEX">' unless -M "pages/$title" > 10/24;
This line creates a variable $meta that will contain instructions to search robots unless the most recent modification occurs more than 10 hours ago. I include the $meta variable in the html header of each page. I recommend all wiki operators adopt this practice and announce that they have done so loudly.

[Note: "NOINDEX, NOFOLLOW" is used for the editing and diffs pages.]

I also recommend that operators select a threshold that is ten times longer than the longest you would expect spam to persist. We see spam removed in minutes, but have chosen to set our threshold to ten times the upper bound of one hour. If you are the only one removing spam and you check your site once a day, set your threshold to ten days.

Also: History files are blocked from indexing via robots.txt.


Deleted pages could be handled the same way, i.e. by deferring the actual delete until the above time span is over. This could reduce the effect of EditWars.


The spammers might not know or care about delayed indexing for a single site. They are probably targeting hundreds of wiki's and don't care if a few make their effort futile.


I am not sure this is a good method since Google should remove pages from its index when it finds the noindex meta tag. -- Joe(at)chongqed.org

Perhaps we could show Google an older version of the page until the "no index" timeout expires. -- JeffGrigg

I wondered about that, but then you would have to detect when Google or other search engines visit. But luckily it looks like there was nothing to worry about. I have been testing it out on our wiki. So far it seems there are no bad effects in my testing. PageRank is returned if the page is removed from the index temporarily. If a page was not indexed during the period Google recalculates PageRank that could be different, but that doesn't happen that frequently. I am relying on the PageRank info we can get from the Google Toolbar which may not be totally accurate or up to date with what they actually use for ranking, but its all we can do. I will give it one more try just to make sure. See: http://wiki.chongqed.org//DelayedIndexing_Experiment

This could still be a problem if a page is very frequently edited since it could be removed from Google's index for a long time, but that probably isn't a problem for most wikis if the noindex expires soon enough. I still don't especially like this idea, but it is good protection for a wiki. It is probably better to have your page not indexed than have it indexed with spam because that will attract other spammers. If the wiki is large and active with a good PageRank Google will probably crawl it often enought that if removed from the index it would not be very long before it recrawled and hopefully did not find another noindex. If the wiki is not active this method will have little benifit anyway since no one will likely find and clean the spam before the noindex expires. Maybe the noindex mode could only be turned on if external links are added to the page. That would allow for normal discussion to go on without the page always going into nofollow mode.

The ideal version of delayed indexing would be to show an old revision to the search engine robots, but that could potentially cause trouble since that could be considered trying to trick the search engine which can get a site banned. The chances of being caught for that are low and if it is being done for a good reason as it is here probably would be no problem. But you would also have to be sure to go back to a clean version. If some spam does slip past the expire, the next cleanup would be the one with noindex while the wiki shows the spammed revision to Google.

There are potential problems with the delayed indexing whole idea but it also provides relatively good protection from attracting new spammers, so it's hard to say whether it's good or bad. That depends on the activity of the wiki and how bad the spam problem is for that wiki. And another major factor in the decision is how important being indexed in Google is to the site. If the wiki is just a supplement to a larger website maybe that doesn't matter, but if it's Wikipedia you want to stay indexed in Google. It certainly is not good for all wikis; too small and it probably doesn't help, too big and it may hurt too much, but somewhere in the middle it could be a perfect solution.

-- Joe(at)chongqed.org


The ideal version of delayed indexing would be to show an old revision to the search engine robots. But that could potentially cause trouble - since that could be considered trying to trick the search engine

Yes, showing the search engine one thing, but showing everyone else another thing, is usually frowned on.

What if we showed the search engine and *most* people one thing? But we showed logged-in administrators (those who set their UserName) something a little different? Would people still frown at that?


I am not sure about this either. The chances are SEs will index some of your pages with this noindex and the SEs will not actually read the text. So when people search for keywords that your page would have come up for, it won't. So by implementing this you are more likely to hurt yourself than any spammers.

zydia@theatons.com http://www.theatons.com/

See http://wikifeatures.wiki.taoriver.net/moin.fcg/DelayedCommits


Perhaps you could make the wiki system add rel=nofollow to all of the external links? That's what a lot of wiki's do

-bdodson


CategoryWiki

EditText of this page (last edited June 3, 2010) or FindPage with title or text search