[pmwiki-users] Defaulting PmWiki to utf8

Eemeli Aro eemeli at gmail.com
Sun Dec 2 10:31:45 CST 2007


On Nov 15, 2007 2:29 AM,  <sti at pooq.com> wrote:
> Now, French computer users are used to seeing URLs with the accents dropped, so
>
>   http://www.example.com/Lang/Francais
>
> would be considered an acceptable URL, although not as acceptable as a page
> name. I've been thinking that for proper utf-8 support, one might want to be
> able to supply a Name->URL mapping function as part of the configuration. In
> the case of French, it would just replace accented characters with their
> non-accented counterparts.

Getting into this discussion a bit late, but I've implemented
something like this on my Finnish (ISO 8859-1) sites, using the
following in my config.php.

$PageNameChars = '-[:alnum:]';
$MakePageNamePatterns = array(
    "/[\xE0-\xE6]/" => 'a', "/[\xC0-\xC6]/" => 'A',     # ulmauts and accents
    "/[\xE8-\xEB]/" => 'e', "/[\xC8-\xCB]/" => 'E',     # char codes from
    "/[\xEC-\xEF]/" => 'i', "/[\xCC-\xCF]/" => 'I',     # ISO 8859-1
    "/[\xF2-\xF8]/" => 'o', "/[\xD2-\xD8]/" => 'O',
    "/[\xF9-\xFC]/" => 'u', "/[\xD9-\xDC]/" => 'U',
    "/'/" => '',                            # strip single-quotes
    "/[^$PageNameChars]+/" => ' ',          # convert everything else to space
    '/((^|[^-\\w])\\w)/e' => "strtoupper('$1')",
    '/ /' => ''
);

It's been a while since I wrote that, so I can't remember all the
magic that this causes behind the scenes, or if there's a better way,
but the effect of this is that all pages are stored with filenames in
the common ASCII range, but referring to them using accents or umlauts
still works, be it in a [[link]] or in a URL. The page titles also
follow the ASCII name if unset, but I've a separate title field on the
edit form to make setting it a relatively easy matter.

It's relatively common practice in Finnish to drop the umlauts (ä, ö)
when having to work in a charset that doesn't support them. The
Finnish language also helps a bit by having vowel harmony, which means
that conflicts are rare and in most cases you can tell whether it's an
'a' or an 'ä'.

In other words, switching to UTF-8 should be relatively painless for
me, but also relatively unimportant. Whatever change happens, I'd very
much prefer a tool to convert everything at once rather than leave the
pages to sit and only convert them when they're edited for the first
time in the new system.

eemeli



More information about the pmwiki-users mailing list