search engine beware! Was: Re: [pmwiki-users] notice of current edit

Fri Apr 15 09:13:20 CDT 2005

On Fri, Apr 15, 2005 at 09:18:22AM -0400, Radu wrote:
> So how about this:
> Since no sane individual can see two different pages in the same second, 
> [...]

Hmm, I frequently request multiple pages within 1-2 seconds of
each other, so I guess I now have yet another reason to question my
sanity.  :-) :-)

But seriously, I really do request multiple pages within 1-2 seconds
of each other.  To review the pages on pmwiki.org I just go to 
Main.AllRecentChanges, and then rapidly click on the links to
pages I want to review while holding down the ctrl key in Firefox.
This causes each page to open in its own tab behind the current
window, and I can begin reviewing pages even before all of them
have finished loading.  When I'm done with a page, I just hit
the tab's close button (or ctrl+F4) and the next page is instantly
there for me to review -- no back button or trying to remember what
I had done yet.  So, rapid page requests on my sites aren't 
at all an indication of spider/bot activity...

> not to mention edit them, there is a way to differentiate between search 
> engines and actual wiki authors: log the timestamp of the previous access 
> from each IP. If it's smaller than a settable interval (default 2s), then 
> do not honor requests for edit. For an even stronger deterrent, to save 
> processor time when the wiki is supposed to be hidden, we could also add an 
> $Enable switch to keep from honoring ANY request to fast-moving IPs.

Well, I don't know if reality bears this out.  First, because of
network latency issues, it's entirely possible for several requests
to be "held" by the OS or Apache and then released to PmWiki in rapid
succession, so that it appears to PmWiki that the requests came in
all-at-once when they were really just held up in a queue somewhere.

Second, while there are indeed search engines that can flood a site
with requests in rapid succession, most requests end up being throttled
in one form or another.  For example, I've just reviewed my logs and 
I can't find any cases where Googlebot issued two or more requests 
within 5-6 seconds of each other, and most are 30+ seconds apart or more.  
(And I haven't found *any* cases where it did an ?action= link 
immediately after a browse.)

More generally, I just did 

   grep -i 'bot\|spider' access_log | grep wiki | sort | less

and the vast majority of bots/spiders' requests are at least 30+ seconds
apart, and I couldn't find any that issued ?action=edit within 5 seconds
of its previous access.  (And I suspect that hitting edit within 5
seconds of the latest browse is not outside of reasonable human 
behavior.)

Lastly, there's the question of the server overhead involved in
providing synchronized access to the log file(s).  This requires one
or more of:
   - locking of the log file(s) among multiple wiki processes
   - scanning the log files to determine when to disallow a request
   - large numbers of log files (e.g., one log file per-IP address)
   - scripts to clean/remove outdated files

Somehow I don't see that this approach would be all that effective.
At least I know it wouldn't bring much benefit to the sites that
I run.

Pm