[pmwiki-users] Defaulting PmWiki to utf8

Patrick R. Michaud pmichaud at pobox.com
Thu Nov 15 08:40:19 CST 2007


On Thu, Nov 15, 2007 at 12:35:27AM -0500, sti at pooq.com wrote:
> >> Since the mapping might be irreversible (as is the case in my 
> >> French example) one would want to store the canonical name as 
> >> an attribute in the page, for use in things like {$FullName}.
> > 
> > PmWiki already stores the canonical name in a page file, as the name=
> > parameter.  A big downside to this approach is that PmWiki cannot
> > then use a simple directory listing to retrieve the set of pagenames
> > on a site -- it must actually read the file to determine the
> > true name.  That slows down a number of operations.
> > 
> > Perhaps this could be reduced somewhat by placing a special
> > extension or marker in the names of pagefiles when the name
> > is a mapped form of the canonical one.  Then, when building
> > a list of files PmWiki only has to read the files having the
> > marker, instead of reading all of them.
> 
> [...] I can also imagine a .pageindex-like scheme whereby an index file
> could hold the canonical names of pages that have their filenames munged. If
> an entry isn't in the index, the file in question is opened and the canonical
> name is added. 

I fear a couple of issues with a .pageindex-like scheme -- primarily,
there's the question of when/how to access it.  Scanning an index file
on every PageRead() is likely to be too slow.  If we try to read the
index once and cache all of the mappings in memory, we may end up
with memory issues for sites that are running with PHP memory limits
set.

But, scanning an index file only when needing to generate a list
of all pages might work out very nicely.  I'll have to think about
that a bit.  In general working with indexes is a pain -- but this would
be one advantage of using a database storage mechanism, of course.

Also, if we're using an index sort of approach, it might make sense to
get the existing .pageindex file to serve double-duty here, and move
indexing into the page-handling functions instead of in the pagelist.php
script.  But this is all sounding more and more like the post-2.2.0
redesigns identified in the RoadMap.

So, perhaps the correct baby step is to switch PmWiki to using utf8
by default via its present mechanisms (i.e., without name mappings),
and then add name mapping features as a post-2.2.0 improvement.
Folks who prefer the somewhat nicer encodings for pagenames (i.e.,
%e7 instead of %c3%a7) will still have the option of selecting
iso-8859-1 for their systems.

It's also worth noting that Wikipedia seems to follow the %c3%a7 
convention, and this is also the RFC-3896 standard.

> Of course, in cases like Chinese where EVERY name is manged,
> that file may grow very big, very fast.

Yes, but for the moment I'm principally concerned only with mapping
of iso-8859-1 names.  People who are using PmWiki in Chinese are
already using utf-8 and I don't feel as pressing a need to solve
url mapping issues there yet.  Nor am I familiar enough with Chinese to
know the character mappings... but if someone can provide it we can
give it a try.

> >> I noticed a while back that the encoding of a page is stored internally. 
> >> [...]
> > 
> > This is true only for pages created in 2.2.0-beta43 or later.
> > Pages created in earlier versions of PmWiki -- including 2.1.x --
> > do not have the charset= attribute, so we can't rely on it
> > being available for conversion.  [...]
> 
> Well, as the current default is iso-8859-1, I guess one could set some sort of
> $LegacyCoding variable to that value, and use it when there is no internal
> indication. Folks who are using an old version with some other encoding will
> have to set it manually when upgrading, which I admit might be error prone.

Folks who are using some other encoding are already including
a xlpage-xxx.php file, and so their configurations already tell us
what encoding they have been using.  Furthermore, in such cases
we don't need to do any conversion to utf8 -- the site can continue 
using its selected encoding without any difficulty.  It does become
a bit more of an issue if such a site wants to _also_ convert to utf8...
but that's outside the scope of what I'm trying to address for now --
i.e., change PmWiki's default to utf8 without breaking sites that were
built with the previous default.

Thanks again, this thread is proving extremely useful to me.

Pm



More information about the pmwiki-users mailing list