[pmwiki-users] Google local site search
Patrick R. Michaud
pmichaud at pobox.com
Tue Dec 27 18:05:33 CST 2005
On Wed, Dec 28, 2005 at 12:51:21AM +0100, Joachim Durchholz wrote:
> >
> >Newer versions of PmWiki (since 2.1.beta8) automatically return
> >"403 Forbidden" errors to robots for any action other than
> >?action=browse, ?action=rss, or ?action=dc.
>
> Um... AFAIK Google punishes sites that are "polymorphic" when crawled by
> Google. (Dunno how they find out - maybe they send a crawler that looks
> just like a normal browser and samples some of the pages. Might be just
> a rumor, but then I'm generally shy of doing pages differently depending
> on who visits it - what if there's a bug in the code that does the
> polymorphism? I'll never find out.)
I've done a bit of research on this, and according to several experts
only sites that present egregiously different content are punished for
polymorphism. Supposedly minor changes to link targets to strip off
things like query parameters aren't supposed to be punished.
> It might be a better idea to mark the ?action=edit etc. links as "don't
> follow by spiders". I.e.
> <a href="...?action=edit" rel="nofollow">...</a>
1. PmWiki determines what to strip based on $ScriptUrl -- in many cases
it doesn't have the full <a href='...'>...</a> tag immediately
available in order to add rel="nofollow" to it. And some <a> tags
already have a rel= attribute.
2. I'm not convinced that adding rel="nofollow" means that the
robot won't follow the link. According to
http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html
and http://microformats.org/wiki/relnofollow, the rel="nofollow"
attribute simply means that the search engine shouldn't give the
link any credit when ranking sites in search results. It doesn't
mean that the robot doesn't follow the link.
Google does say at http://www.google.com/webmasters/bot.html
that placing rel="nofollow" will cause Googlebot to not follow
the link, but that doesn't mean that other robots have to follow suit.
> >In addition, if $EnableRobotCloakActions is set, then any ?action=
> >parameters are removed from page links when viewed by a robot,
> >so that those robots won't blindly follow links to unimportant
> >pages.
>
> How does PmWiki find out it's being accessed by a robot?
It just does a simple pattern match against the User-Agent HTTP header.
The point of robots.php isn't to absolutely detect and control every
possible robot, it's just to detect and manage the most popular ones
and reduce the load on the site.
Pm
More information about the pmwiki-users
mailing list