[pmwiki-users] Too many pmwiki.php processes

Patrick R. Michaud pmichaud at pobox.com
Wed Feb 24 17:21:07 CST 2016


If you haven't done so already, be sure to add include_once('scripts/robots.php') to your configuration file.  It does a lot of things to try to control robots flooding your website.  As you might guess, the pmwiki.org site has had to deal with robot floods as well (often thousands of requests per day), especially from poorly-behaved crawlers that don't obey the current conventions or that don't throttle themselves appropriately.

For example, here is pmwiki.org's robot-related settings in local/config.php:

  $RobotActions['diag'] = 1;
  $RobotPattern = '(?i:bot|crawler|spider)|Slurp|Teoma|ia_archiver|HTTrack|XML Sitemaps Generator|Yandex';
  $EnableRobotCloakActions = 1;
  
  ##  send an error to robots if the site is currently overloaded
  include_once('scripts/robots.php');
  if ($IsRobotAgent) {
    $avg = file_get_contents('/proc/loadavg');
    if ($avg > 5) {
      header("HTTP/1.1 500 Server Busy");
      print("<h1>Server too busy</h1>");
      exit();
    }
  }

The first part is the pattern that gets used to try to identify robot crawlers -- any User Agent that matches this pattern is classified as a robot.  If the robot is requestion an action that doesn't make sense for a robot -- such as a request to edit a page or perform a search -- then PmWiki cuts the connection as quickly as it can to reduce the load on the server.

Setting $EnableRobotCloakActions to true means that PmWiki will try to strip out any ?action= patterns in the pages that it generates to robots.  That way robots are less likely to follow up with requests for ?action= links that are likely to be useless to the robot anyway.

The $IsRobotAgent variable is set to true if PmWiki believes it's being currently accessed by a robot.  On pmwiki.org, if we're being accessed by a robot we first check the current system load (in /proc/loadavg) and immediately return a "500 Server Busy" response to the robot if the system load is above 5.  This basically tells the robot "busy now...try again later".

Next, make sure your skin is setting appropriate rel="nofollow,noindex" attributes to links and in the <head> section of the output document.  Lots of rogue robots don't honor these links, but at least you'll catch the ones that do.

And lastly, I've found that Yahoo! Slurp has been one of the worst offenders of them all in terms of not playing nice with spidering, indexing pages, and honoring directives such as robots.txt.  I ultimately decided to block them at the .htaccess level altogether.

Hope this helps.  There probably needs to be a better documentation or cookbook page about this on PmWiki.org somewhere, if anyone is willing to draft one.

Pm



On Wed, Feb 24, 2016 at 07:23:53AM -0600, Rick Cook wrote:
> They were complaining about too many processes and claiming it was putting too much load on their server.
> 
> The PmWiki sites involved are very simple with no dynamic content.
> 
> Most of the "visitors" are robots of various varieties. The most access_log entries for the year so far across my whole account (~10 PmWiki sites and 2 WordPress sites) was about 9200 on one day. Typically, it was more like 2500 a day. Since this all started, I have put a more restrictive robots.txt file in all of the URL roots and added 192.168 to all of the appropriate .htaccess files (somehow, the non-routeable address 192.168.151.0 was one of the more frequent IP addresses in my access_log files). I think it is running more like 1000 per day now.
> 
> They suggested several IP addresses to block. One was the dynamic IP assigned to my home connection and another was their site monitoring service. Most of the other addresses were attached to well known sites like Google.
> 
> 
> Rick Cook
> 
> > On Feb 24, 2016, at 05:59, ABClf <languefrancaise at gmail.com> wrote:
> > 
> > Why are they complaining : because "too many" (quantitative) or
> > because "too heavy" (your sites cannibalize the server) ?
> > 
> > Maybe you use too heavy pagelists ? What I mean is X processes doesn't
> > tell us how much cpu eager they are (printing out a simple page vs
> > building a complex page from multiparameter pagelist). How many users
> > visiting your sites in the busy hours ?
> > 
> > In that case, if possible, you might be happy with fastcache recipe.
> > That's a must have for me as I use pagelists and ptv a lot. It works
> > nicely (40 visitors on the busy hours).
> > I never delete all cached pages ; I run a cron which delete 300 oldest
> > cached pages every day.
> > 
> > I have been in trouble with my host recently, when my pageindex was
> > lost for unknown reason : hard to rebuild one, process was taking all
> > cpu for minutes. Thus they blocked my site for a few hours. (Shared
> > host, "premium option", 2 cores, 2 go ram, 20 go ssd...).
> > 
> > My host didn't give me useful informations I can read to try to
> > understand what was going wrong ; just told me : run htop to see the
> > processes and cpu/memory usage.
> > 
> > Gilles.
> > 
> > 2016-02-14 11:13 GMT+01:00 Rick Cook <rick at rickcook.name>:
> >> All,
> >> 
> >> My hosting provider is complaining about "too many pmwiki.php processes" running from my PmWiki sites. By "too many", they meant more than 100. I have 10 or so sites active with this provider. With that many sites, having a 100 or more pmwiki.php processes doesn't seem excessive.
> >> 
> >> Has anyone else had this type complaint from their hosting service?
> >> 
> >> 
> >> Thanks,
> >> 
> >> Rick Cook
> >> _______________________________________________
> >> pmwiki-users mailing list
> >> pmwiki-users at pmichaud.com
> >> http://www.pmichaud.com/mailman/listinfo/pmwiki-users
> 
> _______________________________________________
> pmwiki-users mailing list
> pmwiki-users at pmichaud.com
> http://www.pmichaud.com/mailman/listinfo/pmwiki-users



More information about the pmwiki-users mailing list