[pmwiki-users] UTF-8 as core default encoding (was: Headers arenotsending charset !)

Patrick R. Michaud pmichaud at pobox.com
Mon Mar 12 15:03:57 CDT 2007


On Mon, Mar 12, 2007 at 08:46:32PM +0100, Petko Yotov wrote:
> On Monday 12 March 2007 16:53, Patrick R. Michaud wrote:
> > preg_match supports the /u modifier, but the /u modifier doesn't
> > cause either /i or [[:upper:]]/[[:lower:]] in patterns to work.
> > All that the /u modifier does is cause PCRE to recognize multibyte utf-8
> > sequences as being single characters (and that doesn't seem to
> > matter much for the patterns that PmWiki uses).
> 
> Actually, from PHP 4.4.0 on, there is a \p{Ll} and \p{Lu} for lower and upper 
> case letters[1]. 
> ...

Wow, this is very good news.  This will be helpful.

> So, in the next few hours I'll make a major rewrite of xlpage-utf-8.php in 
> order to:
> 
> * move $CaseConversions in another script
> ** load it only when there is no mb_strtoupper, or phpversion < 4.4
> * use the new features when possible (/u, etc.)

I suggest holding off on the rewrite in xlpage-utf-8.php, if only 
because I'm expecting to deal with case-insensitive utf-8 searches
later tonight, and that will undoubtedly cause me to make
a number of important changes to xlpage-utf-8.php at the same time.
(It's okay to work on it if you really want... I just didn't
want you to spend a lot of time working on something I plan to
address today+tomorrow anyway.)

> It may also be possible to actually limit the $PageNameChars to only letters 
> and numbers, but I have little knowledge of other different alphabets other 
> than Latin and Cyrillic.

We probably don't want to do this, as I think it would negatively
impact CJK and other sites with pagenames composed of ideographs
or graphemes that don't fall into the letters/numbers category.
Either that or we'll need to be able to precisely enumerate the
properties that make sense for the CJK languages.

Pm



More information about the pmwiki-users mailing list