[pmwiki-users] Faster searches

Thu Apr 14 03:56:04 CDT 2005

Patrick R. Michaud wrote:

> I did think about doing it the way you describe, where a markup-specific 
> function calls something to get the set of matching pages and then 
> formats the output, but at the time I decided it was too inefficient 
> (and searches are already too slow anyway) and less flexible than the 
> one I chose.

In response to the "searches are already too slow anyway" bit:

I think there's an upgrade path available. If a wiki becomes too large 
for searching, chances are that it's going to be on a "professional" 
webhoster with shell access and a PHP installation that allows calling 
external programs.

For these, it would be great if PmWiki had a way to hook up external 
full-text indexing services.

Some market research on open source indexing libs/services.

htdig (http://www.htdig.org/)
o Needs HTTP access (can use HTTP basic auth to access pages)
o Meta data can be used for result ranking
- Only ISO-Latin-1 character set supported.
- Incremental indexing is designed for batched nightly runs.
   (Incrementality that works after every page save would be better.)
- Insists on presenting its own HTML pages.

mifluz (http://www.gnu.org/software/mifluz/)
The library within htdig.
- needs 50% of raw text size for indexes (30% are typical)
- alpha status
+ can run incrementally (no periodic indexing runs)
   (that's possibly contradictory to htdig properties...)
- Raw library, needs wrapper.

Lucene (http://lucene.apache.org/)
o Java-based.
+ can index incrementally
- Raw library, needs wrapper.
+ Can index "document fields" (i.e. meta data).

Egothor (http://www.egothor.org/)
o Java-based.
+ Available as file-configured stand-alone application.
+ Word stemmers for 11 European languages. (No mean feat, that...)
   Building your own stemmer for a multi-language installation is
   possible, too (stemming quality degrades though, but that's
   unavoidable I think)
- No incremental indexing

Swish-e (http://swish-e.org/)
+ Stand-alone application, configured by file or command line.
o Can index incrementally (in a limited fashion).
- No meta-data indexing. (Properties can be stored along with text,
   but properties aren't index - or so I understood.)
- Limited character set support (no UTF-8 at this time)

Xapian
o Written in C++.
+ Accessible from PHP module.
+ Probabilistic method - can rank results by relevance and do
   other "soft processing".
+ Stemmers for 12 European languages available
o (unsure about stemming on multi-language sites)
+ Incremental indexing.
- No meta-data indexing.

Database-based solutions
------------------------

These require entering the full texts in a database backend. This would 
require making the data storage backend fully pluggable - something that 
PmWiki was designed for anyway, but still a lot of work. (Besides, the 
backend doesn't seem to support all operations that are needed - the 
Rename recipe, for example, circumvents the "official" interface because 
that interface doesn't have a way to rename pages.)

That said, this all is about future changes anyway (i.e. version 2.2 or 
something *g*), so this isn't a serious problem.

More problematic is that databases are more difficult to incrementally 
backup, and more difficult to manipulate using shell scripts.

PostgresQL with tsearch2 module

PostgresQL isn't available on all web hosters, and even then the module 
may not be installed.
No experience reports available at this time.

MySQL on MyISAM tables

Not sure that this is a viable option. MyISAM tables lack several 
features that the (more common) InnoDB tables have.
Again, no reports available here.

Regards,
Jo