[pmwiki-users] TextExtract (Search recipe) update

Hans design5 at softflow.co.uk
Tue Sep 15 04:12:36 CDT 2009


Friday, September 11, 2009, 1:09:42 PM, Hans wrote:

> For TextExtract I cannot just use PmWiki's search engine,
> because we need to extract text. But thanks to your suggestion I was
> inspired to look at the handling of search terms again, and will
> incorporate the way PmWiki's search handles search terms, so we can
> have input like
>   'abc xyz' => output with 'abc' AND 'xyz' in the page;
>   '"abc def" xyz' => output with 'abc def' AND 'xyz' in the page;
>   'abc -xyz' => output with 'abc' but NOT 'xyz' in the page;
>   'abc|xyz' => output with 'abc' OR 'xyz' in the page;

Now available in the latest release.
http://www.pmwiki.org/wiki/Cookbook/TextExtract

I also added some template variables for use in parameters
header= , footer= , phead=
for instance a header with a custom title and the search time:
   header="%rfloat%{$$time}%%'''Listing'''"

I split regular expression search from standard search, to allow
easier term input, and added a checkbox for regular expression search
to the search form.
I added a checkbox for 'Match whole words' for whole word searches.

A note on efficiency:
TextExtract with its in-built pagelist function runs faster than using
PmWiki's pagelist, or MakePageList() function, mainly because
PmWiki's pagelist process opens every page to check if the user is
authorised to see the page, because it does not want to output any
non-authorised pages, for instance read-protected pages. This file
opening can be quite time consuming.
On the other hand TextExtract constructs a pagelist including even
read-protected pages, authorisations are not checked at this stage in
the process. Only later when each page on the source list is opened
will authorisation be checked, before text lines are extracted and
processed. So  a lot less pages need to be opened, which makes for
a faster process. That is the main reason I did not use MakePageList()
as a source pagelist generator.

Still, a possibility remains to use the PmWiki searchbox with  a
fmt=#extract option, which will use PmWiki's pagelist functions
and TextExtract formatting functions. Useful if you need to pass
pagelist parameters TextExtract does not understand.

  ~Hans




More information about the pmwiki-users mailing list