[Pmwiki-users] accented characters
Patrick R. Michaud
pmichaud
Sun Feb 1 09:06:03 CST 2004
On Sat, Jan 31, 2004 at 11:09:14PM +0100, Jean-Claude wrote:
> Congratulations, Patrick, you made my browser work!
>
> For my own culture, would share a part of the "trick" ?
Sure, although it gets fairly technical, and some may need some background.
If you want to skip the background and details, just jump to the bottom
paragraph that says "Solution:".
The problem has to do with differences in the way that browsers
encode non-ASCII characters in URIs. RFC 2396 says that a string
such as "?lh?Co?tAmbig?" should be encoded in a 7-bit compatible form
as "%d4lh%e0Co%eftAmbig%fc", where where the %d4, %e0, %ef, and %fc
are the ?, ?, ?, and ? characters. This is what Netscape and other
browsers do.
However, it didn't take long for people to recognize that
"%d4lh%e0Co%eftAmbig%fc" is a really ugly URI and difficult to
accurately transcribe from other sources (e.g., over the phone
or printed in a book). And, in some languages the characters used
are actually multi-byte sequences. So, RFC 2718 proposes to base
URIs on UTF-8, which is a mechanism for encoding characters above
the range of US-ASCII while preserving the meanings of ASCII characters.
Thus, a browser using UTF-8 encodes "?lh?Co?tAmbig?" as
"%c3%94lh%c3%a0Co%c3%aftAmbig%c3%b8" when it sends the URI to the server,
with the ?, ?, ?, and ? characters encoded as %c3%94, %c3%a0, %c3%af,
and %c3%b8, and the name appears like "?"lh? Co??tAmbig??". This is what
many versions of Internet Explorer do, and a W3C recommendation.
(To compound the problem slightly, IE users can use "Internet Options"
to change IE's UTF8 behavior. :-)
Thus, when a server program receives a URI from a browser containing
a sequence such as "%c3%a0" in it, how is it to know if this represents
the UTF-8 character "?" versus the ISO-8859-1 two-character string
of "?" (%c3) followed by a non-breaking space (%a0)? It's just
coincidence that all of the sequences in this example begin with
%c3--UTF-8 encodings can begin with anything from %c2 to %f7, which
includes most of the ISO-8859-1 letters. (Note that here I'm using
the term "letter" in an internationalized sense; ?, ?, ?, and ? are
all "letters" but don't fall in the US-ASCII range A-Za-z.)
In the general case of receiving character strings from browsers,
there's no reliable way of knowing what was intended, because
character-encoding information isn't yet part of a request. So,
there's no generic solution to the problem yet.
However, in our specific application, we know that we're looking for
characters that are valid wiki page names--i.e., (internationalized)
letters and digits. It turns out that another characteristic of
UTF-8 encodings is that the second and subsequent bytes of non-ASCII
characters must all fall in the range %80 to %bf. What's more, none
of these bytes represent letters in the ISO-8859-1 character set.
===Solution===
Thus, after days of sometimes frustrating exploration and effort over
the past six months, the solution for resolving internationalized wiki
page names appears to be maddeningly simple: All non-ASCII characters
in UTF-8 contain at least one byte in the range %80 to %bf, and none
of these bytes are letters in ISO-8859-1. So, if a browser sends a
page name containing a byte in the range %80 to %bf, then the name
needs to be UTF-8 decoded. In PHP I solved it with two lines of code:
if (preg_match('/[\\x80-\\xbf]/',$pagename))
$pagename=utf8_decode($pagename);
I haven't tested this to see if it will work with charsets other than
ISO-8859-1 (Western European), but I don't see any real reason why it
won't. Time will tell. :-)
Pm
More information about the pmwiki-users
mailing list