[pmwiki-users] TextExtract (Search recipe) update

Mon Sep 7 14:59:50 CDT 2009

Monday, September 7, 2009, 8:20:17 PM, The Editor wrote:

> P.S. Kudos again. I like all the options you have been adding. Lot's
> of good ideas...

Thanks for your support, Dan!

Petko, would it be possible to do an experimental testing
of the TextExtract search box on pmwiki.org?
you could enable the script for Cookbook.TextExtract
or for some page in the Test group, an dwe can set up the search form
to search for instance pages in group Cookbook,
and another one for group PmWiki,
and/or one searching both Cookbook and PmWiki.
I am very curious how it will preform, and if it will perhaps proof
to be helpful for searching with results returned in context.

Monday, September 7, 2009, 8:20:17 PM, The Editor wrote:

> Having said that, I've been rethinking how searches are done, and that
> it might be much smarter to instead try to analyze the search
> parameter and generate a very smart pattern, something like
> /(term1.*term2|term2.*term1)/ for && and /(term1|term2)/ for ||. In
> which case you could scan the index/page text just one time.

It looks rather complicated, and, as i said, it will mess up the
highlighting. I think I rather go the way to stack two or three
search terms as separate patterns (abc xyz) => /abc/ /xyz/
then in each source page search for the first, and if found, search
for the second, etc. extensible using an array.

The page text is already read, and doing another pre_match with the
second pattern is not such big deal. If the second match is found,
we need then run the routines for code cleaning at least once, and
for highlighting twice (for each pattern). Etc. for more than two.

It requires some big code changes, and I am not sure if one needs to
set a limit to the number of terms combined with AND.
I imagine most often one would not go beyond three terms.

Then there is the question: if i look for 'abc' AND 'xyz', do i expect
the search to find them a) in the same line, b) in the same
paragraph, or c) on the same page (refering to TextExtract's 3 ways
as to what to regard as the "unit" of text to return.
I guess it will be expected mostly to find the terms on the same page.

  ~Hans