[Pmwiki-users] Help with PHP regexp

Patrick R. Michaud pmichaud
Mon Jan 5 20:05:01 CST 2004


On Mon, Jan 05, 2004 at 11:39:01AM -0600, John Feezell wrote:
> I recently began studying PHP regular expressions so that I could use them 
> with PmWiki and FTS.  I have material from the PHP manual but would like to 
> know how others on the list have gained knowledge of these - websites, 
> books, etc..

Practice, and just playing with them. 

> It would be helpful to see an analysis of one or two of them as they relate 
> to PmWiki.

Gladly!  PmWiki is largely based on regular expression matching.  In fact,
I've often thought that I could potentially write PmWiki's text processing
engine as a sequence of regular expression match/replacement actions, but
decided that was a bad idea (feels too much like Sendmail's configuration...)

I'll explain each of the patterns below as best I can...

> For example I'm studying the following from PmWiki.php
> $GroupNamePattern="[A-Z][A-Za-z0-9]+";

A wiki group name starts with an uppercase letter and is followed by one or
more letters or digits.

> $WikiWordPattern="[A-Z][A-Za-z0-9]*(?:[A-Z][a-z0-9]|[a-z0-9][A-Z])[A-Za-z0- 
> 9]*";

A bit more complex.  Essentially this pattern says that a WikiWord has to
begin with an uppercase letter, and must have at least one more uppercase
letter and one lowercase letter or digit (in any order).  The ?: after the
opening parenthesis says that the parens are for grouping only and are not
a capturing subpattern.  The part within the parens matches an uppercase
letter followed by a lowercase letter or digit, or vice-versa.

> $FreeLinkPattern="{{(?>([A-Za-z][A-Za-z0-9]*(?:(?:[\\s_]*|-)[A-Za-z0-9]+)*) 
> (?:\\|((?:(?:[\\s_]*|-)[A-Za-z0-9])*))?)}}((?:-?[A-Za-z0-9]+)*)";

Ths is probably the most difficult pattern in PmWiki--it took me a while
to build this one.  I'll take out some of the optimizing paren constructs
to explain it.  A freelink consists of 
   two curly braces, 			      {{
   followed by a word,                        [A-Za-z][A-Za-z0-9]* 
   followed by zero or more words 
     delimited by whitespace, underscores,
     or single hyphens,                       (([\\s_]*|-)[A-Za-z0-9]+)*
   optionally followed by a vertical brace
     and zero or more words delimited by
     whitespace, underscores, or single
     hyphens,                                 (\\|(([\\s_]*|-)[A-Za-z0-9]*))?
   followed by two curly braces,              }}
   followed by any sequence of letters.       (-?[A-Za-z0-9]+)*

Again, the ?: after a paren indicates a non-capturing subpattern, and
the ?> after the first parenthesis helps to optimize the regex match.
   
> $FragmentPattern="#[A-Za-z][-.:\\w]*";

A simple one--a link fragment consists of a '#', followed by a letter,
followed by any sequence of hyphens, dots, colons, or alphanumeric
characters.

> $PageTitlePattern="[A-Z][A-Za-z0-9]*(?:-[A-Za-z0-9]+)*";

A page title is any sequence of words (can be separated by single
hyphens).

> $UrlPathPattern="[^\\s<>[\\]\"\'()]*[^\\s<>[\\]\"\'(),.?]";

The path component of a URL contains any character EXCEPT whitespace,
angle brackets <>, square brackets [], quotation marks "', or parenthesis.
In addition, a URL doesn't end in a comma, period, or question mark.

Questions and comments welcomed.

Pm



More information about the pmwiki-users mailing list