[lug] Another sed question

Tkil tkil at scrye.com
Sat Feb 4 13:09:00 MST 2006


>>>>> "Bill" == Bill Thoen <bthoen at gisnet.com> writes:

Bill> No, I only want the records where E does not immediately follow
Bill> the number, where the number is 1 to 3 digits long and
Bill> immediately follows an 'R'.

Ok, that's pretty easy.  Although as I ask below, do you want records
that are just R followed by 1-3 digits, then nothing else?

Bill> I'm beginning to believe that regex negation is not as
Bill> straight-forward as one might think.

Ha.  I'm pretty sure that's precisely the point I tried to make in my
first reply.  :)

Bill> Or maybe it's a problem especially when preceeded by variable
Bill> expressions like [0-9][0-9]* or [0-9]\{1,3\}. For example "R1E"
Bill> will be correctly not-matched with '/R[0-9][0-9]*[^E]/' and
Bill> "R25E" will be correctly not-matched with
Bill> '/R[0-9][0-9][0-9]*[^E]/' but not with '/R[0-9][0-9]*[^E]/' as
Bill> it should be. But simply turning the logic around with
Bill> '/R[0-9][0-9]*[A-DF-Z]/' works perfectly.

Except it does not cover the case where there are no suffixes, since
"end of string" does not match /[A-DF-Z]/.  Which might or might not
ever happen; I don't know your exact input data.

(Also, that sort of range is also a disaster waiting to happen in the
face of internationalization.  Again, might never happen, but *in
general* when you're reviewing regexps used for text processing, that
sort of a range should throw up a red flag.  The same arguably applies
to "0-9", but the incidence of digits outside the ascii range is
orders of magnitude lower than that of valid letters outside that
range.)

So I thought I could just use a complemented character range, but I
got bit by backtracking.  If I try the simple:

   /^R[0-9]{1,3}([^E].*|)$/

Any case with an E following only 2 digits can match (since the third
digit is consumed as the "something that's not E" character.)  Fancier
RE dialects have "possessive capture" or "atomic capture" that would
fix that...

Excluding digits from the range give me this, which works well:

   /^R[0-9]{1,3}([^E0-9].*|)$/

Again, I can't recommend Freidl's book highly enough.

t.



More information about the LUG mailing list