[pmwiki-users] Pageindex and distributed documentation (Was Strange pagelist behaviour)

Thu Jun 15 12:00:55 CDT 2006

On Thu, Jun 15, 2006 at 12:42:06PM -0400, Neil Herber wrote:
> At 2006-06-15  10:46 AM -0500, Patrick R. Michaud is rumored to have said:
> >But I'd still prefer to come up with some way for pageindex
> >to quickly detect when a page might have been changed outside
> >the normal wiki editing process (without having to actually
> >rescan the page on each check).
> 
> I am not sure how practical or reliable this might be, but wouldn't a 
> file (page) changed outside of the normal wiki processes have a 
> different file system date? If pageindex stored the file system date 
> as part of its index, it could then detect changes made when it 
> wasn't looking. My big assumption here is that getting the file 
> system date is much less costly than rescanning the page.

Good idea -- I had been thinking of something along these lines as
well.  First, you're entirely correct that getting the file system
date ought to be much less costly than rescanning the entire page.

And, as it turns out, .pageindex already stores a timestamp for
each entry in the index.  It's the time the index entry was generated
instead of the timestamp of the pagefile, but perhaps that will
be good enough for our purposes.  If the timestamp of a page file 
is newer than the timestamp of its index entry, we probably
ought to re-index the page.

So far I've avoided going down this route because even though
getting the file system date is less costly than rescanning
the page, it does still incur an expense.  Earlier this year
when pmwiki.org was having performance issues I did a lot of
testing, and one of the things I concluded was that the
filesystem on my virtual private server had a high degree of
latency and could quickly become a bottleneck if there
are a lot of requests to the filesystem -- even simple requests
such as "does file XYZ exist"?  (I suspect the files may
be held on a remote attached storage device of some sort.)

So, I'm hoping to avoid the expense of checking file
timestamps on *all* of the page files with each search
request.  It might be sufficient to check only those
pages that will otherwise be excluded, but even this
would tend to be a majority of the existing pages.

Or another possibility could be to simply check the
timestamps of those pages that are being scanned anyway,
and then invalidate the entire index if *any* 
inconsistency in the timestamps or index is detected.

Pm