[pmwiki-users] UTF-8 as core default encoding (was: Headers are not sending charset !)

Patrick R. Michaud pmichaud at pobox.com
Mon Mar 12 09:03:30 CDT 2007


On Mon, Mar 12, 2007 at 09:22:02AM +0200, Athan wrote:
> As already told before, there are many other issues with utf-8, most of them 
> with reciepts. Given the fact that most reciepts use single byte regex, 
> utf-8 issues is something expected. However I know that you cannot do 
> anything for that, except maybe consider using utf-8 as the single-one 
> default encoding for pmwiki core.

The question of using utf-8 as the default encoding in the core
comes up a fair bit (e.g., in PITS 00682), so let me answer it 
directly.

Yes, I agree that utf-8 is now the preferred encoding for web pages,
even for English and Western European languages.  However, that wasn't
the case when PmWiki was created, which is partially why PmWiki has 
traditionally defaulted to iso-8859-1.  But more to the point, there 
are a variety of PHP functions and features that simply fail to work 
properly when utf-8 is the encoding being used.

The biggest limitation with utf-8 is that regular expression patterns 
can no longer use "[[:upper:]]" and "[[:lower:]]" to match non-ASCII
uppercase and lowercase characters in strings.  This used to be a serious
limitation when many sites were running with WikiWords enabled, because
PmWiki could not detect wikiwords in the markup text without these
patterns.  It's much less of an issue now that PmWiki ships with WikiWords
disabled by default... but it's still a bit of an issue.

Many case-insensitive functions cease to be case-insensitive for utf-8;
in particular, the '/i' flag to preg_match and preg_replace patterns
doesn't seem to work for non-ASCII letters.

Another limitation is that some locales (e.g., date and time strings
returned by PHP's strftime() function) expect to be displayed using an 
iso-8859-1 character set, and thus won't work properly if utf-8 is chosen.

Still another problem is dealing with non-ASCII characters in filenames; 
switching to a utf-8 encoding means that any existing pages or attachments 
with non-ASCII characters in their names will have to be fixed in order
to work properly.  And, at least on my systems (Linux), filenames with
iso-8859-1 encodings display properly, while utf-8 filenames appear garbled.
(I fully admit that for many people this situation is reversed, such that
utf-8 appears correct while iso-8859-1 appears garbled... my point is simply
that no matter what PmWiki does by default it is going to cause problems 
for some group of people.)

So, in order to default to utf-8, we have to provide workarounds for
the things that don't work in PHP, and every workaround has the potential
to really slow down page rendering and other features.  Rather than
default to utf-8 and thus hit _every_ site with the workaround performance
penalty even when utf-8 isn't being used, PmWiki defaults to iso-8859-1
(where PHP works most efficiently) and lets those sites that need or
want utf-8 encoding do a simple include to get utf-8 to work.

This isn't to say that PmWiki will never switch to using a utf-8 encoding
by default... I'm only saying that there are a few large hurdles yet to be
overcome before we can do that.

Pm



More information about the pmwiki-users mailing list