[pmwiki-users] Group and page name aren't recognized by pmwiki.php

Joachim Durchholz jo at durchholz.org
Wed Mar 23 14:33:05 CST 2005


Patrick R. Michaud wrote:
> On Wed, Mar 23, 2005 at 12:44:07PM +0100, Joachim Durchholz wrote:
> 
>>>>The .htaccess file looks like this:
>>>> RewriteEngine on
>>>> RewriteCond %{REQUEST_URI} !^/*(pmwiki\.php|pub)/+.*$
>>>> RewriteRule ^(.*)$ pmwiki.php/$1 [L,NS]
>>>
>>>This last RewriteRule is bad, because it's going to try to redirect
>>>everything to pmwiki.  That's probably not what you want.  Most
>>>of the CleanUrls recipes take advantage of the fact that group
>>>names begin with an uppercase letter.
>>
>>Well, this construction is quite exactly what I want, since Group names 
>>can begin with other things than just A-Z (for example, they can start 
>>with umlauts). I think the recipes are often naive in this regard.
> 
> Assuming ISO-8859-1, one can always fix the RewriteRule to work with
> leading accented characters by using
> 
>    RewriteRule ^([A-ZÀ-Þ].*)$ pmwiki.php?n=$1 [L,qsappend]

Hmm... a bit difficult to explain on a Cookbook page. It would require 
going into a lot of details (including the nontrivial question of how to 
enter that 'Þ' character).
Also there's the question whether PmWiki is smart enough to provide 
uppercase equivalents for all lower-case characters. (That's a 
nontrivial transformation actually: German 'ß' would have to be 
transformed to 'Sz' or 'Ss', depending on personal preference... and 
that means that there's a possibility that transformed names may clash 
where untransformed ones didn't, which has a potential for further 
semantic confusion. Not that any German word would start with 'ß', but a 
wiki is about enabling people to invent stuff, so there...)

>> For this reason, I consider it safer to exclude what I know must be
>> excluded (i.e. pmwiki.php and pub, and probably the uploads
>> directory as well), and consider everything else a group/page name.
> 
 > You probably also want to exclude robots.txt, and perhaps favicon.ico.
 > In general, it's better to specify what you want to include as opposed
 > to what you want to exclude, but every site has its own needs.

I have a different idea: anything that starts with a lower-case 7-bit 
ASCII character should be considered a URL that should be served from 
the file system, everything else should be served via PmWiki, giving me 
this rule:

RewriteCond %{REQUEST_URI} !^/*[a-z]
RewriteRule ^(.*)$ pmwiki.php/$1 [L,NS]

The nice thing about this rule is that it separates URL space quite 
cleanly into two distinct areas: filesystem space (pub, pmwiki.php, 
favicon.ico etc.) and Wiki space (Home, Änderungsbedarf, Égalité, 
égalité, etc.).
Even nicer is that it's very, very likely that evolving web standards or 
PmWiki directory layouts will all stick with lower-case 7-bit ASCII, so 
there's virtually no danger that this distinction will ever break.

> I don't know if Apache has support for named character classes; if so
> then one could use
> 
>    RewriteRule ^([[:upper:]].*)$ pmwiki.php?n=$1 [L,qsappend]
> 
> but this didn't work on my server.

Maybe only Apache 2.0.

The Apache 1.3 docs remain silent on the type of regex engine used. 
Apache 2.0 claims it's using Perl regexes and refers to the Perl regex 
page, which in turn lists character classes (and even gives tantalizing 
hints on UTF-8 support). Alas, there are still too many Apache 1.3 
installations around, so PmWiki can't make use of all these nice features...

>>I also made some disturbing about the consistency of syntax usage in 
>>PmWiki itself and in skins. At least those skins that I have downloaded 
>>use PathInfo syntax for links on the template page (so these skins would 
>>break on a query syntax-only site); 
> 
> Wrong--they absolutely *don't* break.  PmWiki internally converts 
> PathInfo-style urls to query-syntax urls whenever $EnablePathInfo 
> is not set.

This implies that PmWiki always recognizes PathInfo-style URLs even if 
$EnablePathInfo is 0.

Is this correct?

>>> If there's a reason why the simple recipe in Cookbook.CleanUrls 
>>> won't work, let me know and we'll go from there.
>> 
>> Yup, I found it. It's this code in pmwiki.php, near line 225:
>> 
>>if (!$pagename &&
>>  preg_match('!^'.preg_quote($_SERVER['SCRIPT_NAME'],'!').'/*([^?]*)!',
>>    $_SERVER['REQUEST_URI'],$match))
>>  $pagename = urldecode($match[1]);
>>
>> The first parameter of pref_match evaluates to a regex that matches
>> something like '/path/to/wiki/pmwiki.php' (with an optional
>> appended '/Group/Page' and/or a '?QueryString'); however, rewriting
>> just elided the script name, so it doesn't match.
> 
> Ah, I see.  And that's because you were using the PathInfo syntax
> in the mod_rewrite.  Using the query-string syntax means that this
> code never needs executing (because $pagename is already set).

This matches my analysis.

>> Doing some case analysis of Apache's behaviour and googling for the
>> CGI specification however showed me that there's an environment
>> variable named PATH_INFO that contains just what PmWiki needs:
>> PATH_INFO.
>> [...]
>> However, looking at the CGI spec revealed that PATH_INFO is
>> required to have precisely the value that PmWiki wants (modulo the
>> initial slash that needs to be stripped off).
>> So I'm baffled: why doesn't PmWiki use PATH_INFO?
> 
> Because it doesn't work on a lot of sites.  Don't believe every 
> specification you read

No I don't - that's why I asked :-)

 > --although the CGI spec clearly says that
> this is what PATH_INFO is supposed to have, there are a *lot* of 
> webservers, including Apache and PHP, that do not follow this 
> part of the spec.  IIS is another one.

Hmm... I did a few experiments with Apache 1.3, and it worked perfectly. 
In fact I have patched my site to use PATH_INFO, and it's been doing the 
Right Thing all day.

PathInfo is not set if there is no PathInfo after the script name. I.e. 
URLs like http://maquaris.de/pmwiki.php or (if the RewriteRule is in 
effect) http://maquaris.de do not have a PATH_INFO, but that's OK: there 
is not slash-separated stuff after the script name after all.

> Apache 2.0.30 introduced an AcceptPathInfo directive;

... and in direct violation to the CGI spec <sigh>...

 > unfortunately
> the default setting on many servers/PHP environments is that urls
> with a PATH_INFO component results in a 404 Not Found error.  The only
> way to fix this is to set AcceptPathInfo On, which assumes that the
> wiki administrator has privileges to do so.  (For more details, see
> http://httpd.apache.org/docs-2.0/mod/core.html#acceptpathinfo ).

Oh, how I like that stuff. <Apache rant deleted...>

> Even for those sites that have a PATH_INFO variable set, it's often 
> *not* the value defined by the CGI specification.  If PHP is running
> in cgi-script mode, then PATH_INFO can end up being the entire url
> path (which is treated as an argument to the php executable), and not
> just the portion of the url that comes after what we think of as
> the script name.
> 
> Things get much worse in IIS.

As usual ;-}

> If you go back and search the archives you'll see that PmWiki has 
> always had a strong preference to using PATH_INFO, but I eventually 
> had to add an $EnablePathInfo option (default on) so that servers
> that didn't support PATH_INFO could still run using the query string.
> But this still left a lot of new admins wondering why PmWiki wasn't
> working on their site, so in 2.0.beta8 I changed the default for
> $EnablePathInfo to off, so that an initial install generally works
> everywhere and then wiki admins can easily try turning $EnablePathInfo
> on to test if it will work on their server.  Regardless, PmWiki knows
> how to convert between PATH_INFO urls and query string urls, and
> nearly always does the right thing depending on the value of 
> $EnablePathInfo.

Understood.

However, I think the current solution can be improved.

As far as I can tell, $EnablePathInfo influences just the generation of 
PmWiki URLs from PmWiki itself; it doesn't seem to influence parsing the 
incoming URL. (That's OK: it's very much in line with the policy that 
one should be liberal when accepting data and restrictive when sending it.)
If I'm interpreting the code correctly, PmWiki tries the following steps 
in order until one succeeds:
1) Get _REQUEST ['n']. (Filled from ?n=... by PHP.)
2) Get _REQUEST ['pagename'].
3) Parse REQUEST_URI, under the assumption that it consists of
_SERVER ['SCRIPT_NAME'], a slash, and the Group/Page name.

Step (3) actually is wrong: it shouldn't use SCRIPT_NAME, it should use 
$ScriptUrl from config.php. Unfortunately, this can't be fixed easily 
since config.php is called after the URL parse (so it can make decisions 
based in, say, what group we're in - e.g. to select a different skin).

However, step (3) is a useful heuristics, so it does have its place.

Now let me propose another heuristics:
*if* PATH_INFO is set, let PmWiki take the group/page name from there.
This should be done between steps (2) and (3).

Here's the code to do it:

if (!$pagename &&
     preg_match('!^/*(.*)!', $_SERVER['PATH_INFO'], $match))
   $pagename = $match[1];

I intentionally didn't urldecode() the value - the CGI specs say that 
the string should be decoded by the server. (I hope that if a server 
fills PATH_INFO, it also honors that part of the CGI spec...)

What do you think?

Regards,
Jo



More information about the pmwiki-users mailing list