[lug] Another sed question

Tkil tkil at scrye.com
Sun Feb 5 04:01:37 MST 2006


>>>>> "Bill" == Bill Thoen <bthoen at gisnet.com> writes:

>> Any case with an E following only 2 digits can match (since the
>> third digit is consumed as the "something that's not E" character.)
>> ...

Bill> Why wouldn't it look at the next character *after* the last
Bill> digit and decide if that one is E or not? That's what I thought
Bill> it did.

You underestimate the tenacity of the regexp engine.  :) Backtracking
is what trips us here.

Consider the line "R12ES".  (Which I don't think was in your list of
examples, but as a bad case...)

If we use the regexp /^R[0-9]{1,3}([^E].*|)$/, we can try to follow
what the regex engine does to match it.  I'll use two lines for each
"state"; the first is the string being matched ("R12ES" in this
example), the second will have a caret indicating what character the
engine is currently looking at.

1. R12ES
  ^        start-of-line matches "^", good so far.

2. R12ES
   ^       matches "R", good.

3. R12ES
    ^      matches "[0-9]{1,3}" with one replication.

4. R12ES
     ^     matches "[0-9]{1,3}" with two replications.

5. R12ES
      ^    uh-oh.  doesn't match "[0-9]{1,3}", doesn't match
           "([^E].*|)$" either.  this fails.  pop off this match.

6. R12ES
     ^     Since (5) failed, we backtrack to (4).  We have the option
           of choosing whether the phrase "[0-9]{1,3}" matches 1, 2,
           or 3 times; since having it match 2 (the most *greedy*)
           didn't work, we try just one.  That means that "2" can now
           match the phrase "[^E]"

7. R12ES
      ^    This matches ".*", good.

8. R12ES
       ^   This matches ".*", good.

9. R12ES
        ^  This matches "$", finished, and we found a match.

And that's about the best I can do with just text.  The idea to take
away from this is that regular expression matching is a series of
assumptions made by the regex engine; the engine will continue making
different assumptions until there are no more options left, or until
it has successfully made a match.

(It's worth mentioning that the decision made in step 4 is really the
crux of whether something will match "greedily" or not; greedy is the
default, so given the choice between moving on to the next chunk of
regexp, or grabbing more input for the current chunk, regexps default
to grabbing more.  So the first thing it tries is to increase the rep
count from 1 to 2...)

The behavior you were expecting (that the digits would be "consumed"
and not "given back") is associated with "atomic capture" or
"possessive quantifiers" in sufficiently advanced RE engines.

In Perl, this should do the trick:

   perl -nlwe '/^(?>R\d{1,3})(?:[^E].*|)$/ && print' list

Note the use of "(?>...)" for atomic grouping.  It tells the perl RE
engine that, should it get to the end of that group, that it can only
accept or reject the matched text as a unit -- it can't "give back"
characters and backtrack into the bits of that match.  And it does
exactly what we want:

| $ cat list
| R21E
| R142E
| R12/SENW
| R222W
| R1E
| 
| $ # without atomic grouping:
| $ perl -nwe '/^R\d{1,3}(?:[^E].*|)$/ && print' list
| R21E
| R142E
| R12/SENW
| R222W
| 
| $ # with atomic grouping:
| $ perl -nwe '/^(?>R\d{1,3})(?:[^E].*|)$/ && print' list
| R12/SENW
| R222W

Java's standard regexp package has "possessive quantifiers" instead of
atomic grouping; instead of

   (?>R\d{1,3})

one would write:

   R\d{1,3}+

This use of "+" to indicate possessive/atomic matches echoes the use
of "?" for non-greedy matching:

   .*        match zero or more of whatever, greedily.
   .*?       match zero or more of whatever, miserly.
   .*+       match zero or more of whatever, as a unit.

Clear as mud?  Hope it's a bit better than that.  :)

t.



More information about the LUG mailing list