[pmwiki-users] Robustness of PmWiki

Fri Jun 23 22:54:10 CDT 2006

On Sat, Jun 24, 2006 at 02:08:03AM +0200, Martin Bayer wrote:
> I'm currently running my wiki under MoinMoin. It's a managed wiki hosting,
> and now the admin decided to exclude the 'Slurp' u/a (Yahoo/ Inktomi search
> robot) completely by setting a corresponding rule in the robots.txt file,
> because, as he puts it, that bot is 'ddossing' the system. [...]

Short answer:  Yahoo! Slurp is really lame.

This problem came up for pmwiki.org back in March, and I posted a number of
messages describing what was happening with PmWiki.  (Thread starts at
http://host.pmichaud.com/pipermail/pmwiki-users/2006-March/024894.html .)

Essentially, while watching the access logs I noticed that Slurp always
seemed to be crawling my site, so I did a quick analysis of the access
logs and determined that Slurp was responsible for about 18% of all of
the hits to my server (207,510 out of 1.2 million total hits over a period
of 15 days).

In short, Slurp was requesting nearly 14,000 pages per day.

What's worse is that Slurp isn't at all smart about what it requests --
in that 15 day period it asked for the /robots.txt file (which hadn't
changed) at least 5,500 times.   There were over 1,600 separate
urls that it requested at least fifteen times (i.e., once per day or more).
Some urls were especially egregious, such as the 380 requests for the
home url, or the 230 requests for http://www.pmwiki.org/wiki/PITS/PITS .

Thus, I can completely understand your admin's position on Slurp.
I can't think of any reason why a search engine's spiders need to be
requesting the same url several times per day.  In this respect it
does have a similar impact to a ddos attack.

I'll detail my approach to Slurp below...

> However, even if it were 100 or 150 requests contemporally, I think that's
> just the way the internet works. And if a wiki engine cannot handle this,
> the wiki engine is broken, not the internet. 

I can't entirely agree.  By their nature wikis are highly dynamic content
generators, and expecting a wiki to keep the expected level of dynamic
content generation and be blindingly fast on average hardware or a heavily
shared hosting environment is a bit unrealistic.  

(For those who say "PmWiki should cache the HTML output of each page
to avoid regenerate the content on each request", this ends up being 
*very* difficult to do in practice and still preserve some of PmWiki's
most requested features.  The appearance of any given page can depend
heavily on the visitor's identity, authorizations, the contents of
a page, or even the date or time of day.  Add custom recipes to the
mix and... well, it's hard to keep a useful cache. :-)

> Furthermore, 100 contemporary
> requests could also be generated by human users if your project is big
> enough.

Yes, but if a project is big enough to be generating that sort of load,
it's probably big enough to warrant more than a basic shared webhosting
account.

> So, my question is: How would PmWiki react in a similar situation? Is it
> able to handle more than 100 contemporary requests? I know, this depends
> also on the machine used, so let's say on an average hardware that you will
> have with an average shared webhosting. 

Well, first of all, there are very few shared webhosting environments that
can reasonably handle 100 simultaneous requests for dynamically generated
content.  At that level, many sites even have trouble with static content, 
as they try to spawn enough webserver processes to handle the load.  Also,
other factors start to come into play, such as disk latency, network
bandwidth, operating system tuning, etc.

Still, PmWiki does fairly well.  The pmwiki.org/pmichaud.com site receives 
over 2 million hits per month, and is running on a virtual private server
(i.e., shared with other VPS instances on the same hardware) and the VPS
is running several domains in addition to pmwiki.org/pmichaud.com.  Of the
2 million hits, at least 40% of them are explicitly for PmWiki pages
(as opposed to graphic images, CSS files, or other static content).  
So, pmwiki.org (and PmWiki) sustains reasonably heavy traffic volumes.

Texas A&M University-Corpus Christi runs a wiki that currently has over
200,000 pages, and is heavily used for writing classes.  In such an
environment it's not at all rare for PmWiki to receive 50 or more
contemporary requests, and so far it's been able to handle the load
reasonably well (at least, none of the students or instructors have
indicated any dissatisfaction with response times).  Of course, that site
has a somewhat more powerful server than what a shared hosting account
would typically provide, but still it shows that it's possible for PmWiki
to handle such a load.

> Furthermore, as you all presumeally
> have a PmWiki running, how are your own experiences with the 'Slurp' bot?

I already outlined my analysis of Slurp above.  Because of the heavy
toll of search engine spiders, PmWiki has a fair amount of robot control
built-in.  First, it doesn't return any "non-browse" pages to robots,
thus if a robot follows a link to things like "?action=edit" or "?action=diff",
PmWiki quickly returns a "403 Forbidden" to the robot, to avoid any cost
of generating the responses to those requests.

I sent a message to Yahoo! asking for more details about Slurp and 
the fact that it seemed to be making excessive requests to my site, but
never received a response.  

Personally, I figure that if Yahoo! won't be a good Internet citizen 
and have some respect for the costs its spiders are incurring
on my site (and bandwidth that I have to pay for), I don't really 
care if my sites' content appears in its search engine.

My response to Slurp has been to set robots.txt to give Slurp a
Crawl-delay of 60 seconds, simply to reduce the sheer volume of
requests.  60 seconds is a fairly high value for Crawl-delay -- 
most articles recommend values of 5 or 10.  Still, even with this 
high Crawl-delay value, Slurp has hit the pmwiki.org/pmichaud.com 
site 106,706 times so far in June (7% of all hits).  For comparison, 
Googlebot has 91,504 hits, and msnbot has 45,987 hits.

In short, given what I've seen of Slurp's behavior on the sites I run, I 
think that webserver administrators are entirely justified in 
severely restricting or denying Slurp's access to the webserver.  
Perhaps if enough web administrators start complaining about
Slurp then Yahoo! will fix their broken bot.

Pm