[pmwiki-users] Defaulting PmWiki to utf8

Wed Nov 14 23:35:27 CST 2007

Patrick R. Michaud wrote:
> On Wed, Nov 14, 2007 at 07:29:12PM -0500, sti at pooq.com wrote:
>> Patrick R. Michaud wrote:

>> Now, French computer users are used to seeing URLs with the accents 
>> dropped, so
>>
>>   http://www.example.com/Lang/Francais
>>
>> would be considered an acceptable URL, although not as acceptable as a page
>> name. I've been thinking that for proper utf-8 support, one might want to be
>> able to supply a Name->URL mapping function as part of the configuration. In
>> the case of French, it would just replace accented characters with their
>> non-accented counterparts.
> 
> This sounds a lot like the string folding function that PmWiki is
> already using to perform case-and-accent-insensitive searches for
> utf-8.  So perhaps we could modify the tables slightly to be able
> to serve both purposes.

Yes, I think so, at least in the case of Latin languages.

>> Then, when looking for a page on disk, PmWiki would first look 
>> under the name as given, and secondly under the mapped name. 
>> Since the mapping might be irreversible (as is the case in my 
>> French example) one would want to store the canonical name as 
>> an attribute in the page, for use in things like {$FullName}.
> 
> PmWiki already stores the canonical name in a page file, as the name=
> parameter.  A big downside to this approach is that PmWiki cannot
> then use a simple directory listing to retrieve the set of pagenames
> on a site -- it must actually read the file to determine the
> true name.  That slows down a number of operations.
> 
> Perhaps this could be reduced somewhat by placing a special
> extension or marker in the names of pagefiles when the name
> is a mapped form of the canonical one.  Then, when building
> a list of files PmWiki only has to read the files having the
> marker, instead of reading all of them.

That might work. One might also have a flag for if the file names should be
stored as-is. For those who don't often do shell access, this might be
acceptable. I can also imagine a .pageindex-like scheme whereby an index file
could hold the canonical names of pages that have their filenames munged. If
an entry isn't in the index, the file in question is opened and the canonical
name is added. Of course, in cases like Chinese where EVERY name is manged,
that file may grow very big, very fast.

>> I noticed a while back that the encoding of a page is stored internally. 
>> [...]
> 
> This is true only for pages created in 2.2.0-beta43 or later.
> Pages created in earlier versions of PmWiki -- including 2.1.x --
> do not have the charset= attribute, so we can't rely on it
> being available for conversion.  We _can_ figure out what version 
> of PmWiki was used to write such pages... but that still doesn't
> really tell us what charset was in use at the time, or whether
> any sort of conversion needs to be performed.  :-|

Well, as the current default is iso-8859-1, I guess one could set some sort of
$LegacyCoding variable to that value, and use it when there is no internal
indication. Folks who are using an old version with some other encoding will
have to set it manually when upgrading, which I admit might be error prone.