[pmwiki-users] Defaulting PmWiki to utf8

Thu Nov 15 18:18:24 CST 2007

On Nov 15, 2007 3:40 PM, Patrick R. Michaud <pmichaud at pobox.com> wrote:
> On Thu, Nov 15, 2007 at 12:35:27AM -0500, sti at pooq.com wrote:
> > >> Since the mapping might be irreversible (as is the case in my
> > >> French example) one would want to store the canonical name as
> > >> an attribute in the page, for use in things like {$FullName}.
> > >
> > > PmWiki already stores the canonical name in a page file, as the name=
> > > parameter.  A big downside to this approach is that PmWiki cannot
> > > then use a simple directory listing to retrieve the set of pagenames
> > > on a site -- it must actually read the file to determine the
> > > true name.  That slows down a number of operations.
> > >
> > > Perhaps this could be reduced somewhat by placing a special
> > > extension or marker in the names of pagefiles when the name
> > > is a mapped form of the canonical one.  Then, when building
> > > a list of files PmWiki only has to read the files having the
> > > marker, instead of reading all of them.
> >
> > [...] I can also imagine a .pageindex-like scheme whereby an index file
> > could hold the canonical names of pages that have their filenames munged. If
> > an entry isn't in the index, the file in question is opened and the canonical
> > name is added.
>
> I fear a couple of issues with a .pageindex-like scheme -- primarily,
> there's the question of when/how to access it.  Scanning an index file
> on every PageRead() is likely to be too slow.  If we try to read the
> index once and cache all of the mappings in memory, we may end up
> with memory issues for sites that are running with PHP memory limits
> set.
>
> But, scanning an index file only when needing to generate a list
> of all pages might work out very nicely.  I'll have to think about
> that a bit.  In general working with indexes is a pain -- but this would
> be one advantage of using a database storage mechanism, of course.
>
> Also, if we're using an index sort of approach, it might make sense to
> get the existing .pageindex file to serve double-duty here, and move
> indexing into the page-handling functions instead of in the pagelist.php
> script.  But this is all sounding more and more like the post-2.2.0
> redesigns identified in the RoadMap.
>
> So, perhaps the correct baby step is to switch PmWiki to using utf8
> by default via its present mechanisms (i.e., without name mappings),
> and then add name mapping features as a post-2.2.0 improvement.
> Folks who prefer the somewhat nicer encodings for pagenames (i.e.,
> %e7 instead of %c3%a7) will still have the option of selecting
> iso-8859-1 for their systems.
>
> It's also worth noting that Wikipedia seems to follow the %c3%a7
> convention, and this is also the RFC-3896 standard.

Maybe I've missed something but I don't see any reason for such
complicated solution. I read this thread from the beginning and found
that it started here:

"Now, French computer users are used to seeing URLs with the accents
dropped, so http://www.example.com/Lang/Francais would be considered
an acceptable URL, although not as acceptable as a page name.".

Why not? We work with accented characters a lot and we use existing
"mapping" for this - page titles. Because we want readable URLs (as
http://www.example.com/Lang/Francais), we create page name in
"deaccented" form and fill page title with accents. Links to such page
are either in form [[link title -> pagename]] or [[pagename|+]].
Simple. No need for some other mapping, index etc. So the only thing
that would make us happy is to add "deaccentation" call to
MakePageName function (I hope that's the function that removes spaces
and other unsupported characters and changes case). That would allow
to use links without the need to specify link title. What other
benefits could bring the solution you are discussing?

BTW, probably PmWIki could borrow some UTF-8 related functions from
Dokuwiki, e.g. deaccentation (see
http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php). Notice,
that they are doing also "romanization" for cyrilic languages.

Roman