[pmwiki-users] TextExtract (Search recipe) update

Mon Sep 7 14:20:17 CDT 2009

On Tue, Sep 1, 2009 at 9:51 AM, Hans<design5 at softflow.co.uk> wrote:
> Tuesday, September 1, 2009, 1:20:22 PM, The Editor wrote:
>
>> Nice recipe Hans!  You have thought of many nice new features. Kudos
>
> Thanks, Dan!
>
> One thing I would love to add sometime, but don't know how yet,
> because it is complex, is:
> How to do a proper AND search:
> return lines or paras with both 'abc' AND 'xyz' terms, but not any
> with only one of them.
>
> At present entering 'abc xyz' is the same as entering '"abc" "xyz"',
> and only the complete string will be matched.
> I could enter 'abc.*xyz' but that does not match 'xyz.*abc'
> I could enter '(abc.*xyz)|(xyz.*abc)', which will get both abc and
> xyz together. But you see how complicated the input gets, and the
> highlighting will fail.
>
> The thing to do would be recursive preg_matching on individual search
> patterns like first find 'abc', then check for 'xyz', etc.
> The highlighting will be still a nightmare to program.

I don't know exactly how to do this either Hans. In BoltWire you can
do booleans for the search function if you have something like text1
&&/|| text2. The code splits that into two searches, scans the index
for each part, then intersecting/merging the resulting arrays. At
present our find function, which is much closer to this plugin does
not have boolean capabilities, but I would suspect you would have to
do it the same way. That is, for each page you are scanning, do a more
complex routine at that point in the foreach loop to handle the
boolean operators, using a couple scans, and then
merging/intersecting.

Of course to get text1 text2 to work, you would I have to first do
some manipulation to convert it to text1 && text2.

Having said that, I've been rethinking how searches are done, and that
it might be much smarter to instead try to analyze the search
parameter and generate a very smart pattern, something like
/(term1.*term2|term2.*term1)/ for && and /(term1|term2)/ for ||. In
which case you could scan the index/page text just one time. I haven't
looked at how Pm implemented booleans, but it might be instructive. If
you find anything, please pass the info my way. It's a back burner
project, but I'm very interested.

Cheers,
Dan

P.S. Kudos again. I like all the options you have been adding. Lot's
of good ideas...