[pmwiki-users] 2.2.0-beta43 released (drafts, expressions, diff, utf-8)

Patrick R. Michaud pmichaud at pobox.com
Mon Apr 16 07:36:54 CDT 2007


On Mon, Apr 16, 2007 at 10:04:55AM +0200, Petko Yotov wrote:
> On Sunday 15 April 2007 23:16, Patrick R. Michaud wrote:
> > Searches on sites using utf-8 are now performed case-insensitively
> > for accented characters.
> 
> Hello Patrick.
> 
> There is a problem with the $CaseConversions array, line 214:
> 
>    "\xc9\xbd" => "\x171\xa4",

You're correct.  It's now fixed for the next release.

> I tried to find what is written in the source [1] but could not find the 
> sequence C9BD.

"\xc9\xbd" is a UTF-8 sequence, while the UnicodeData.txt file
gives codepoint values (not UTF-8).  So, \xc9\xbd (utf8) is the
encoding for U+027D.

The uppercase conversion of U+027D is U+2C64.  The codepoint U+2C64
requires a 3-byte UTF-8 encoding, but the program I wrote to translate
codepoints to UTF-8 was set up to only handle 1-byte and 2-byte 
conversions, so it mis-encoded the sequence as \x171.

So, the uppercase conversion of U+027D (\xc9\xbd in UTF-8) is 
U+2C64 (\xe2\xb1\xa4 in UTF-8).

I had already caught the 3-byte instances elsewhere in the
tables, but apparently missed this one.  Thanks for catching it!

Pm



More information about the pmwiki-users mailing list