[lug] Another sed question

Sat Feb 4 00:33:50 MST 2006

>>>>> "Bill" == Bill Thoen <bthoen at gisnet.com> writes:

Bill> Given the following items in the file called 'list':

Bill> R21E
Bill> R142E
Bill> R12/SENW
Bill> R222W
Bill> R1E

Bill> I want to print out only the ones where the 'R' and following
Bill> number (which can be anything in the range of 0 to 999) is NOT
Bill> followed by an 'E'.

Is there any reason to stick with 'sed'?  This is a perfect use of
"grep" (or its friends 'fgrep' and 'egrep'...):

   egrep '^R[0-9]{1,3}' list | egrep -v '.+E'

Although it's not clear whether you want, e.g., "R12/SENW" printed; I
read your request as not wanting that (because it's a number from 0 to
999 followed by an "E", just not /immediately/ by an "E").

More abstractly, the fact that negation is difficult to describe in
regular expressions is a deep problem.  Let me try to rephrase your
criterion a bit more clearly:

   Given a list of strings, print only those strings that start with
   "R", followed by 1 to 3 digits, and where the remaining string does
   not contain an "E".

If that's a correct paraphrase, then you could do it with a single
regex (only because the negation is of a single character, which /can/
be handled with complementation:

   egrep '^R[0-9]{1,3}[^E]*$' list

E.g.:

   | $ cat list
   | R21E
   | R142E
   | R12/SENW
   | R222W
   | R1E
   | $ egrep '^R[0-9]{1,3}' list | egrep -v '.+E'
   | R222W
   | $ egrep '^R[0-9]{1,3}[^E]*$' list
   | R222W

(The rest of this message is me jumping off to the deep end...)

On the other hand, had you said "... where the remaining string does
not contain the substring 'xy', you'd be in trouble, because pure
regexps can't cleanly indicate that.

In those cases, I tend to head towards a mix of pure regex and
procedural; in the pipeline above, I use the "procedural" ability to
apply two different REs.  In Perl, I might write something like this:

   if ( /^(R\d{1,3}(.*))$/ ) {
       my ( $all, $rest ) = ( $1, $2 );
       if ( $rest !~ /E/ ) {
           print $all;
       }
   }

Aside: You can come up with abominations that say "one or more
instances of: zero or more of anything but 'x', or an 'x' followed by
anything other than a 'y'..."  Uhg.  And before you think that's too
outlandish, consider standard C comments: they're started by the
two-character sequence "/*" and closed by the matching "*/"; but
because the closing token is two characters, it is difficult to
describe in simple regexps.

Having said that, there are cheats for lots of this, and most regex
languages are no longer strictly "regular": backreferences and
non-greedy capture are the two that come immediately to mind.

Even with that assistance, you run into problems when you try to write
a regex that would match C-style comments while honoring double-quoted
strings and backslashed escapes in that string.  E.g., this fragment
is a nice torture test:

   /* save this for later:
    *    char * comment_string = "/* ... */";
    */

Much more on this sort of fun is in _Mastering Regular Expressions_ by
Jeffery Freidl; if you expect to be doing much text wrangling -- and
who doesn't? -- I can't recommend that book highly enough.

t.