[pmwiki-users] Faster searches
Joachim Durchholz
jo at durchholz.org
Thu Apr 14 03:56:04 CDT 2005
Patrick R. Michaud wrote:
> I did think about doing it the way you describe, where a markup-specific
> function calls something to get the set of matching pages and then
> formats the output, but at the time I decided it was too inefficient
> (and searches are already too slow anyway) and less flexible than the
> one I chose.
In response to the "searches are already too slow anyway" bit:
I think there's an upgrade path available. If a wiki becomes too large
for searching, chances are that it's going to be on a "professional"
webhoster with shell access and a PHP installation that allows calling
external programs.
For these, it would be great if PmWiki had a way to hook up external
full-text indexing services.
Some market research on open source indexing libs/services.
htdig (http://www.htdig.org/)
o Needs HTTP access (can use HTTP basic auth to access pages)
o Meta data can be used for result ranking
- Only ISO-Latin-1 character set supported.
- Incremental indexing is designed for batched nightly runs.
(Incrementality that works after every page save would be better.)
- Insists on presenting its own HTML pages.
mifluz (http://www.gnu.org/software/mifluz/)
The library within htdig.
- needs 50% of raw text size for indexes (30% are typical)
- alpha status
+ can run incrementally (no periodic indexing runs)
(that's possibly contradictory to htdig properties...)
- Raw library, needs wrapper.
Lucene (http://lucene.apache.org/)
o Java-based.
+ can index incrementally
- Raw library, needs wrapper.
+ Can index "document fields" (i.e. meta data).
Egothor (http://www.egothor.org/)
o Java-based.
+ Available as file-configured stand-alone application.
+ Word stemmers for 11 European languages. (No mean feat, that...)
Building your own stemmer for a multi-language installation is
possible, too (stemming quality degrades though, but that's
unavoidable I think)
- No incremental indexing
Swish-e (http://swish-e.org/)
+ Stand-alone application, configured by file or command line.
o Can index incrementally (in a limited fashion).
- No meta-data indexing. (Properties can be stored along with text,
but properties aren't index - or so I understood.)
- Limited character set support (no UTF-8 at this time)
Xapian
o Written in C++.
+ Accessible from PHP module.
+ Probabilistic method - can rank results by relevance and do
other "soft processing".
+ Stemmers for 12 European languages available
o (unsure about stemming on multi-language sites)
+ Incremental indexing.
- No meta-data indexing.
Database-based solutions
------------------------
These require entering the full texts in a database backend. This would
require making the data storage backend fully pluggable - something that
PmWiki was designed for anyway, but still a lot of work. (Besides, the
backend doesn't seem to support all operations that are needed - the
Rename recipe, for example, circumvents the "official" interface because
that interface doesn't have a way to rename pages.)
That said, this all is about future changes anyway (i.e. version 2.2 or
something *g*), so this isn't a serious problem.
More problematic is that databases are more difficult to incrementally
backup, and more difficult to manipulate using shell scripts.
PostgresQL with tsearch2 module
PostgresQL isn't available on all web hosters, and even then the module
may not be installed.
No experience reports available at this time.
MySQL on MyISAM tables
Not sure that this is a viable option. MyISAM tables lack several
features that the (more common) InnoDB tables have.
Again, no reports available here.
Regards,
Jo
More information about the pmwiki-users
mailing list