[lug] glibc regexp bug?

Tkil tkil at scrye.com
Tue Jun 7 21:38:18 MDT 2005


>>>>> "DS" == D Stimits <stimits at comcast.net> writes:

DS> I'm trying to verify if the regular expression is valid or not...
DS> regcomp() always fails on this expression: "*". 

Traditionally, regular expression ("RE") engines have taken the
attitude that, if a special character shows up where it cannot act in
its special role, it should be viewed as a literal.

So under traditional RE engines, "*" by itself should match the
literal asterisk character.  It turns out that (under FC3 at least), 'grep' and 'perl' disagree:

| $ perl -lwe 'if ( "*" =~ m/*/ ) { print "traditional!" }'
| Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at -e line 1.
| $ echo '*' | grep '*'
| *

DS> Just a plain old wildcard that should stand for "any number of any
DS> character". 

As others have pointed out, you're confusing true regular expressions
(of varying degrees of complexity, but those supported by grep, egrep,
awk, perl, etc) with "globbing", as originally supported by unix
shells (csh, sh, bash, etc).

Not that I've had this conversation before or anything:

   http://groups-beta.google.com/group/comp.lang.perl.misc/msg/68a814cd0e65cc3e?dmode=source&hl=en

>From which I quote:

     GLOB        Perl RE
     ----        -------
     *           .*
     ?           .
     .           \.
     [a-z]       [a-z]

     file.name   ^file\.name$
     *.ext       \.ext$
     file.*      ^file\.
     *.*         \.

DS> Can anyone tell me if "*" is technically not a valid regular
DS> expression? Or is this a bug in glibc on fedora?

Technically, it's invalid; * is a quantifier which must follow some
other valid regular expression.  Traditionally, as described above,
some regular expresion tools have interpreted this as "match a single
literal asterisk" instead of erroring out, but the proper way to write
that in most tools is "\*".

To amplify that last bit, you need to present the sequence BACKSLASH
ASTERISK to the regex engine; perl only takes one, but if you're using
a tool (such as C, Java, etc) which accepts regexes only in double-
quoted form, your code will actually look like "\\*", because back-
slashes are special in double-quoted strings and "\*" is (so far as I
know) not valid C.)

>>>>> "RM" == rm  <rm at fabula.de> writes:

RM> are you talking about a single '*' character? That's a
RM> metacharacter (modifier) that is only valid after a pattern.

I'm not sure I remember where I picked up the "if it's invalid, see if
we can use it literally" rule, but even FC3 'grep' still obeys it.  It
is a kindness of the tool though, as you're entirely correct in
pointing out that it's not a valid regexp in that form.

>>>>> "DS" == D Stimits <stimits at comcast.net> writes:

DS> I guess if BLUG had to pick a new name someone might name it
DS> "Interesting Programming and Computer Trivia Group", I figure
DS> someone here will know (I've seen a huge number of regular
DS> expression and perl shortcut discussions here...probably as many
DS> as in the rest of the Internet combined) :P

I have NO IDEA what you're talking about.  ;->

t.



More information about the LUG mailing list