[pmwiki-users] Defaulting PmWiki to utf8

Patrick R. Michaud pmichaud at pobox.com
Wed Nov 14 20:51:14 CST 2007


On Wed, Nov 14, 2007 at 07:29:12PM -0500, sti at pooq.com wrote:
> Patrick R. Michaud wrote:
> > The big problem is that any existing pages of an iso-8859-1
> > site will have been saved using an iso-8859-1 encoding, using
> > iso-8859-1 encoded filenames.  Thus, it's not just a simple
> > matter of changing a configuration option -- we also have to
> > convert the various page files as well.
> 
> Right now I'm working on a site with many French page names. I looked into
> using utf-8 when I started a few months ago but ran into some problems, and
> ended up changing back.
> 
> Most of my problems had to do with the fact that my hosting site didn't have
> good utf-8 support for its shell-based tools, but I didn't like what happened
> to my URLs either.

This is very good information to know.

> Now, French computer users are used to seeing URLs with the accents 
> dropped, so
> 
>   http://www.example.com/Lang/Francais
> 
> would be considered an acceptable URL, although not as acceptable as a page
> name. I've been thinking that for proper utf-8 support, one might want to be
> able to supply a Name->URL mapping function as part of the configuration. In
> the case of French, it would just replace accented characters with their
> non-accented counterparts.

This sounds a lot like the string folding function that PmWiki is
already using to perform case-and-accent-insensitive searches for
utf-8.  So perhaps we could modify the tables slightly to be able
to serve both purposes.

> Then, when looking for a page on disk, PmWiki would first look 
> under the name as given, and secondly under the mapped name. 
> Since the mapping might be irreversible (as is the case in my 
> French example) one would want to store the canonical name as 
> an attribute in the page, for use in things like {$FullName}.

PmWiki already stores the canonical name in a page file, as the name=
parameter.  A big downside to this approach is that PmWiki cannot
then use a simple directory listing to retrieve the set of pagenames
on a site -- it must actually read the file to determine the
true name.  That slows down a number of operations.

Perhaps this could be reduced somewhat by placing a special
extension or marker in the names of pagefiles when the name
is a mapped form of the canonical one.  Then, when building
a list of files PmWiki only has to read the files having the
marker, instead of reading all of them.

Side note:  There are times when I really _regret_ that the DBM
functions are deprecated in PHP -- they could be really useful
for optimizing performance and indexing situations like these.
Yes, PHP offers a 'dba' module as an alternative, but many
PHP installations do not include it in the build, so we can't
rely on it being present.

> I noticed a while back that the encoding of a page is stored internally. 
> [...]

This is true only for pages created in 2.2.0-beta43 or later.
Pages created in earlier versions of PmWiki -- including 2.1.x --
do not have the charset= attribute, so we can't rely on it
being available for conversion.  We _can_ figure out what version 
of PmWiki was used to write such pages... but that still doesn't
really tell us what charset was in use at the time, or whether
any sort of conversion needs to be performed.  :-|

Pm



More information about the pmwiki-users mailing list