[pmwiki-users] Google local site search

Tue Dec 27 18:05:33 CST 2005

On Wed, Dec 28, 2005 at 12:51:21AM +0100, Joachim Durchholz wrote:
> >
> >Newer versions of PmWiki (since 2.1.beta8) automatically return 
> >"403 Forbidden" errors to robots for any action other than 
> >?action=browse, ?action=rss, or ?action=dc.
> 
> Um... AFAIK Google punishes sites that are "polymorphic" when crawled by 
> Google. (Dunno how they find out - maybe they send a crawler that looks 
> just like a normal browser and samples some of the pages. Might be just 
> a rumor, but then I'm generally shy of doing pages differently depending 
> on who visits it - what if there's a bug in the code that does the 
> polymorphism? I'll never find out.)

I've done a bit of research on this, and according to several experts
only sites that present egregiously different content are punished for
polymorphism.  Supposedly minor changes to link targets to strip off
things like query parameters aren't supposed to be punished.

> It might be a better idea to mark the ?action=edit etc. links as "don't 
> follow by spiders". I.e.
>   <a href="...?action=edit" rel="nofollow">...</a>

1.  PmWiki determines what to strip based on $ScriptUrl -- in many cases
    it doesn't have the full <a href='...'>...</a> tag immediately
    available in order to add rel="nofollow" to it.  And some <a> tags
    already have a rel= attribute.

2.  I'm not convinced that adding rel="nofollow" means that the
    robot won't follow the link.  According to 
    http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html
    and http://microformats.org/wiki/relnofollow, the rel="nofollow"
    attribute simply means that the search engine shouldn't give the
    link any credit when ranking sites in search results.  It doesn't
    mean that the robot doesn't follow the link.

    Google does say at http://www.google.com/webmasters/bot.html
    that placing rel="nofollow" will cause Googlebot to not follow
    the link, but that doesn't mean that other robots have to follow suit.

> >In addition, if $EnableRobotCloakActions is set, then any ?action=
> >parameters are removed from page links when viewed by a robot,
> >so that those robots won't blindly follow links to unimportant
> >pages.
> 
> How does PmWiki find out it's being accessed by a robot?

It just does a simple pattern match against the User-Agent HTTP header.  
The point of robots.php isn't to absolutely detect and control every 
possible robot, it's just to detect and manage the most popular ones
and reduce the load on the site.

Pm