[pmwiki-users] Keep() function documented

Joachim Durchholz jo at durchholz.org
Wed Jul 6 04:00:00 CDT 2005


Patrick R. Michaud wrote:
> On Tue, Jul 05, 2005 at 11:34:01PM +0200, Joachim Durchholz wrote:
> 
>>> There's nothing to "prevent" someone from doing it, but such 
>>> nested keeps are generally not restored properly.  That's what I
>>> meant by "not legally nestable" -- someone can write a script 
>>> that does it, but they may not be entirely happy with the
>>> results.
>> 
>> Ah, I see.
>> 
>> Note that such nesting can easily happen if a markup calls PRR().
> 
> Perhaps; more generally I'd suspect it's a misuse of Keep.  In 
> general, once a particular markup has been processed, its pattern no
> longer appears in the string for reprocessing.
> 
>> Hmm... I've been thinking about a different approach: introduce a 
>> $RedoMarkupRule that, if set, will cause MarkupToHTML to repeat the
>>  current rule. The 'restore' rule could then set that variable so
>> it will be automatically repeated until no keep tokens are left.
> 
> What would be a good application of this?

1) It would do "what one expects" if Keep tokens are hidden within kept 
texts (the restore is repeated until there are no tokens left in the 
expanded text, which is exactly what we'd want).
2) For parsing nested stuff (see below for more details).

My intuition also says that there might be additional uses, though I 
can't tell which (but I have learnt to trust my intuition).

On re-thinking, I even think it might be a good idea to have a 
$RedoMarkupFromRule variable that tells markup processing to restart at 
a given rule - sometimes you have several interacting rules, and the 
last of them may want to reactivate the first rule. (OTOH I'm generally 
wary about groups of collaborating markups - things can get somewhat 
uncontrollable if some other markup places itself in-between such a 
group of rules.)

(It would be best if all the rules were integrated into a single one, so 
that there is no "order" in which markups are processed - but I don't 
think that's an option for PmWiki. Maybe for PmWiki 3.0, but definitely 
not for the short term. Besides, I fiddled with the idea and found it 
would run into limitations of the PCRE engine - it cannot handle more 
than 99 capturing parentheses. IOW it isn't even implementable before 
the PCRE libraries in PHP are upgraded to handle an arbitrary number of 
captures, and any older versions are several years obsolete. Ah well...)

>> Such a variable would also help with parsing nested constructs of
>> all kinds: let the rule recognise just the innermost construct,
>> process it, and replace it with a keep token, then repeat, until
>> all nesting levels have been processed.
> 
> Unfortunately, such a procedure would also mean that any markups 
> within the nesting constructs are not processed, since they've been 
> hidden from further markup processing (which is what Keep() does --
> it hides things from markups).
> 
> I suppose one could restore all of the Keep items immediately after 
> processing the nested constructs,

That's exactly what I think should be done. Plus each nestable construct 
should be assigned an ID, which gets attached to each element of the 
construct at restoration time. This would make it easy for other markup 
to identify which closing parenthesis belongs to a given opening one 
(with "parenthesis", I mean anything that opens and closes a nestable 
construct).

Here's an example how to nest (:if:)...(:then:)...(:else:)...(:ifend:) 
that way:
The first rule is "structural". It just says
   /\(:if:\)(.*?)\(:then:\)(.*?)\(:else:\)(.*?)\(:ifend:\)/ie
(double the backslashes to make it a PHP string...)
Note the use of ungreedy matching that makes sure it captures only the 
innermost if-then-else construct.
The replacement text looks like this (assuming there's a function NextId 
that increases the ID counter and then returns its parameter, and a 
global variable $Id that contains an "id token" which includes that ID 
counter):
   Keep(
     NextId(
       "$Id(:if:)\$2$Id(:then:)\$4$Id(:else:)\$6$Id(:ifend:)"
     ), 'block'
   )
i.e. the delimiters of each nestable construct is tagged with that $Id.

Keep() makes sure that the innermost if-then-else construct won't be 
recognised again, so the now-innermost if-then-else will be recognised. 
Repeat the markup rule until no if-then-else constructs remain.

The next rule after the if-then-else one should re-expand the 'block' 
tokens. (It should also clear out each token after expansion, to keep 
the memory footprint manageable. Otherwise, each such keep/restore 
process would increase memory usage by the number of characters enclosed 
in the outermost block.)

After that, a third rule can actually process the structure.

> but this is really not what Keep() was intended to do in the general
> case.

The quality of any code is easily measured the ease with wich it can be 
subjected to creative abuse ;-)

Actually another way to parse nested constructs that doesn't need Keep() 
just occurred to me: just make sure that the Id is inserted in a way 
that make the construct invisible to the parser. I.e. instead of replacing
   (:if:)
with
   $Id(:if:)
replace it with
   (:${Id}if:)
and it won't be captured again.

After that, the construct will look something like
   (:\222 4711 \222if:)...
   (:\222 4711 \222then:)...
   (:\222 4711 \222else:)...
   (:\222 4711 \222ifend:)
which won't be parseable by the original ("structural") rule.


A second ("semantic") rule can then easily recognise a construct, by 
using this regex (split over several lines to increase reability):
   /\(:(\222 \d+ \222)if:\)(.*?)
    \(:\1then:\)(.*?)
    \(:\1else:\)(.*?)
    \(:\1ifend:\)/ie
(again, omitting the doubled backslashes that would be needed to write 
the regex in a PHP string).


Q.: Why is the split into "structural" and "semantic" rules important?

A.: Because nested structures must be recognised inside-out, but 
executed outside-in.

Assume we have
   (:if:) database exists (:then:)
     (:if:) we have the right database password (:then:)
       (:mysql table ...:)
     (:ifend:)
   (:else:)
     Some nice error message
   (:ifend:)
Now if we process the inner (:if:) when it's recognised, that will 
happen first, before there was a check for database existence. If the 
database happens to not exist, we will get all sorts of nasty PHP error 
messages instead of "Some nice error message".

>> (If the nesting construct may span several lines, this would have
>> to happen before line splitting. ... BTW why does PmWiki split the
>> text into lines? Efficiency reasons, or other considerations?)
> 
> Two reasons:  First, the line-by-line model is the mental model that 
> most authors tend to understand when processing text; it makes sense 
> to keep that particular model.

That's the model that regexes work under anyway (unless you add the 
options that make it think that ^$ are start resp. end of string instead 
of start/end of line, and that . includes the end-of-line character).

OTOH it *is* a bit safer that way. Having more options makes it easier 
to misapply them.

It's just that this forces constructs that may span several lines into a 
*very* early stage of processing (whether these constructs are nestable 
or not) - unless you can map the nestable construct directly to a 
nestable HTML construct, which allows you to handle the beginning and 
end of the construct in separate rules and not worry about nesting (but 
that approach is not always possible, and sometimes not the best one).

> Secondly, it's a huge efficiency boost -- my experiments have shown
> me that the many pattern matches that get performed are *much* more
> efficient on many small strings than they are on one very large one.

I suspect it's the replacement step that is more efficient - replacing 
two characters with fifteen in a twenty-character string is bound to be 
more efficient than doing the same in a 20K text (there are advanced 
string packages that don't exhibit this behavior, but they have been 
largely unknown and unused).

Regards,
Jo



More information about the pmwiki-users mailing list