[Pmwiki-users] accented characters

Sun Feb 1 09:50:13 CST 2004

Thank you Patrick!
This is another piece of information, albeit somewhat hairy, that I will
keep in my vault, for future use.
Although it may prove difficult to quote in a dinner ;-)

You are on the verge of opening an exciting pandora box, IMHO, since you may
soon get demand to achieve the same spectacular result with eastern
alphabets, then ideograms!

Jean-Claude
----- Original Message -----
From: "Patrick R. Michaud" <pmichaud at pobox.com>
To: "Jean-Claude" <Wiki at jcgg.net>
Cc: <pmwiki-users at pmichaud.com>
Sent: Sunday, February 01, 2004 5:05 PM
Subject: Re: [Pmwiki-users] accented characters

> On Sat, Jan 31, 2004 at 11:09:14PM +0100, Jean-Claude wrote:
> > Congratulations, Patrick, you made my browser work!
> >
> > For my own culture, would share a part of the "trick" ?
>
> Sure, although it gets fairly technical, and some may need some
background.
> If you want to skip the background and details, just jump to the bottom
> paragraph that says "Solution:".
>
> The problem has to do with differences in the way that browsers
> encode non-ASCII characters in URIs.  RFC 2396 says that a string
> such as "?lh?Co?tAmbig?" should be encoded in a 7-bit compatible form
> as "%d4lh%e0Co%eftAmbig%fc", where where the %d4, %e0, %ef, and %fc
> are the ?, ?, ?, and ? characters.  This is what Netscape and other
> browsers do.
>
> However, it didn't take long for people to recognize that
> "%d4lh%e0Co%eftAmbig%fc" is a really ugly URI and difficult to
> accurately transcribe from other sources (e.g., over the phone
> or printed in a book).  And, in some languages the characters used
> are actually multi-byte sequences.  So, RFC 2718 proposes to base
> URIs on UTF-8, which is a mechanism for encoding characters above
> the range of US-ASCII while preserving the meanings of ASCII characters.
> Thus, a browser using UTF-8 encodes "?lh?Co?tAmbig?" as
> "%c3%94lh%c3%a0Co%c3%aftAmbig%c3%b8" when it sends the URI to the server,
> with the ?, ?, ?, and ? characters encoded as %c3%94, %c3%a0, %c3%af,
> and %c3%b8, and the name appears like "?"lh? Co??tAmbig??".  This is what
> many versions of Internet Explorer do, and a W3C recommendation.
> (To compound the problem slightly, IE users can use "Internet Options"
> to change IE's UTF8 behavior. :-)
>
> Thus, when a server program receives a URI from a browser containing
> a sequence such as "%c3%a0" in it, how is it to know if this represents
> the UTF-8 character "?" versus the ISO-8859-1 two-character string
> of "?" (%c3) followed by a non-breaking space (%a0)?  It's just
> coincidence that all of the sequences in this example begin with
> %c3--UTF-8 encodings can begin with anything from %c2 to %f7, which
> includes most of the ISO-8859-1 letters.  (Note that here I'm using
> the term "letter" in an internationalized sense; ?, ?, ?, and ? are
> all "letters" but don't fall in the US-ASCII range A-Za-z.)
>
> In the general case of receiving character strings from browsers,
> there's no reliable way of knowing what was intended, because
> character-encoding information isn't yet part of a request.  So,
> there's no generic solution to the problem yet.
>
> However, in our specific application, we know that we're looking for
> characters that are valid wiki page names--i.e., (internationalized)
> letters and digits.  It turns out that another characteristic of
> UTF-8 encodings is that the second and subsequent bytes of non-ASCII
> characters must all fall in the range %80 to %bf.  What's more, none
> of these bytes represent letters in the ISO-8859-1 character set.
>
> ===Solution===
> Thus, after days of sometimes frustrating exploration and effort over
> the past six months, the solution for resolving internationalized wiki
> page names appears to be maddeningly simple:  All non-ASCII characters
> in UTF-8 contain at least one byte in the range %80 to %bf, and none
> of these bytes are letters in ISO-8859-1.  So, if a browser sends a
> page name containing a byte in the range %80 to %bf, then the name
> needs to be UTF-8 decoded.  In PHP I solved it with two lines of code:
>
>    if (preg_match('/[\\x80-\\xbf]/',$pagename))
>        $pagename=utf8_decode($pagename);
>
> I haven't tested this to see if it will work with charsets other than
> ISO-8859-1 (Western European), but I don't see any real reason why it
> won't.  Time will tell.  :-)
>
> Pm
>