[pmwiki-users] pagelist performance analysis

Mon Apr 4 18:00:30 CDT 2005

On Mon, Apr 04, 2005 at 06:32:47PM -0400, Martin Fick wrote:
> Since I have an interest in getting categories to work
> faster, I have done some analysis of (:paglist:).  
> [...]

I've already got quite a few speedups planned for pagelist --
I just haven't had a chance to implement them yet.  Two of them are:

1.  Categories currently works by scanning the entire markup text
for matches to the category name-- this isn't necessary, as only
the targets= attribute needs to be scanned, which is a *lot* shorter.
This should result in a big speedup.

2.  Currently reading a page file reads the entire page's history,
even when it's not needed.  I'm going to be reorganizing the page storage
so that this can be skipped, again resulting in a big speedup.

>  1) The FmtPageName function, which the comments say are
>     used to:
>  [...]
>    Simple hack (pmwiki.php):
>       function FmtPageName($fmt, $pagename) {
> 	global $FarmD;
> 	if ($fmt == 'wiki.d/$FullName') return "wiki.d/$pagename";
> 	if ($fmt == '$FarmD/wikilib.d/$FullName') return "$FarmD/wikilib.d/$pagename";
> 	return FmtPageNameO($fmt, $pagename);
>       }

This will only work if $pagename is already in 'Group.SomePage' format.
Sometimes the $pagename will come in as 'Group/SomePage', in which
case the above won't work.  YMMV.  

>  2) Reading many files in PHP.  I made many hacks with page
>     content caching.  On pages with multiple paglists  this
>     is a great improvement.  The problem is that, of
>     course, this only proves that it's slow, it doesn't
>     help speed up the simple (probably most important case)
>     of one pagelist.

There's another problem -- PHP often runs in limited memory 
environments (8 megabytes is common).  Caching page content in
memory can easily hit this bottleneck.

>     To speed this up, I resorted to brute force: grep.

Grep is useful, but it assumes that 
   (1) grep is available (not true in many Windows environments)
   (2) grep can be executed (not true in many safe_mode environments)
   (3) the number of files to be grepped doesn't exceed the character
       limits for shell commands

The *big* speed increase for categories would be to have the target=
values for each page file stored in a cross-reference index file
somewhere.  The problem with that is coherence -- it's too easy for
the index to become de-synchronized with other pages in the site.

OTOH, perhaps we could opportunistically rebuild the index once, rather
than having to have each pagelist query rebuild it dynamically as
happens now.

At any rate, pagelist speedups *are* in the works, but since there
are a lot of things involving pagelists that have to be reworked
(including sorting of results, categories, trails) it's going to
happen all at once.  As I said in a previous message -- don't rely
too heavily on the current implementation internals of (:pagelist:), 
as they're very likely to change in the near future.

Pm