[Pmwiki-users] accented characters

Sun Feb 1 09:06:03 CST 2004

On Sat, Jan 31, 2004 at 11:09:14PM +0100, Jean-Claude wrote:
> Congratulations, Patrick, you made my browser work!
> 
> For my own culture, would share a part of the "trick" ?

Sure, although it gets fairly technical, and some may need some background.
If you want to skip the background and details, just jump to the bottom
paragraph that says "Solution:".

The problem has to do with differences in the way that browsers
encode non-ASCII characters in URIs.  RFC 2396 says that a string
such as "?lh?Co?tAmbig?" should be encoded in a 7-bit compatible form
as "%d4lh%e0Co%eftAmbig%fc", where where the %d4, %e0, %ef, and %fc 
are the ?, ?, ?, and ? characters.  This is what Netscape and other 
browsers do.

However, it didn't take long for people to recognize that
"%d4lh%e0Co%eftAmbig%fc" is a really ugly URI and difficult to
accurately transcribe from other sources (e.g., over the phone
or printed in a book).  And, in some languages the characters used
are actually multi-byte sequences.  So, RFC 2718 proposes to base 
URIs on UTF-8, which is a mechanism for encoding characters above 
the range of US-ASCII while preserving the meanings of ASCII characters.  
Thus, a browser using UTF-8 encodes "?lh?Co?tAmbig?" as 
"%c3%94lh%c3%a0Co%c3%aftAmbig%c3%b8" when it sends the URI to the server,
with the ?, ?, ?, and ? characters encoded as %c3%94, %c3%a0, %c3%af,
and %c3%b8, and the name appears like "?"lh? Co??tAmbig??".  This is what 
many versions of Internet Explorer do, and a W3C recommendation.  
(To compound the problem slightly, IE users can use "Internet Options"
to change IE's UTF8 behavior. :-)

Thus, when a server program receives a URI from a browser containing
a sequence such as "%c3%a0" in it, how is it to know if this represents
the UTF-8 character "?" versus the ISO-8859-1 two-character string
of "?" (%c3) followed by a non-breaking space (%a0)?  It's just 
coincidence that all of the sequences in this example begin with 
%c3--UTF-8 encodings can begin with anything from %c2 to %f7, which
includes most of the ISO-8859-1 letters.  (Note that here I'm using 
the term "letter" in an internationalized sense; ?, ?, ?, and ? are 
all "letters" but don't fall in the US-ASCII range A-Za-z.)

In the general case of receiving character strings from browsers, 
there's no reliable way of knowing what was intended, because
character-encoding information isn't yet part of a request.  So,
there's no generic solution to the problem yet.

However, in our specific application, we know that we're looking for
characters that are valid wiki page names--i.e., (internationalized)
letters and digits.  It turns out that another characteristic of
UTF-8 encodings is that the second and subsequent bytes of non-ASCII
characters must all fall in the range %80 to %bf.  What's more, none
of these bytes represent letters in the ISO-8859-1 character set.

===Solution===
Thus, after days of sometimes frustrating exploration and effort over 
the past six months, the solution for resolving internationalized wiki 
page names appears to be maddeningly simple:  All non-ASCII characters 
in UTF-8 contain at least one byte in the range %80 to %bf, and none 
of these bytes are letters in ISO-8859-1.  So, if a browser sends a 
page name containing a byte in the range %80 to %bf, then the name 
needs to be UTF-8 decoded.  In PHP I solved it with two lines of code:

   if (preg_match('/[\\x80-\\xbf]/',$pagename)) 
       $pagename=utf8_decode($pagename);

I haven't tested this to see if it will work with charsets other than
ISO-8859-1 (Western European), but I don't see any real reason why it
won't.  Time will tell.  :-)

Pm