[lug] Regex Help
Anthony Foiani
tkil at scrye.com
Sun Jul 10 01:02:06 MDT 2011
"George Sexton" <georges at mhsoftware.com> writes:
> I'm using an application that matches regular expressions in URLs.
>
> I'd like it to match
>
> /somepath/*
>
> But not
>
> /somethingelse/somepath/*
>
> I can write an expression to match /somepath/*. The problem is it's matching
> the second thing which I don't want.
>
> I don't get to write a lot of code.
>
> I don't know what the host name will be. It might be a fqdn, might be an IP
> Address.
>
> The input has the full URL syntax:
>
> Scheme:hostname/path/
Ok. A few corner cases:
1. Should we match on:
http://example.com/somepath/somepath/... ?
That is, are you actually testing for a lack of second level?
(My current guess is that you're writing ad-blocking regexes, and
want to match "http://example.com/ad/..." but not
"http://example.com/foo/ad/...")
2. Does your engine happen to parse the url into { scheme, host, path
} parts, as you indicate? (And possibly also query string)?
Anyway, the other suggestions are perfectly valid, and it sounds like
you have a solution that works. I'll just point out a few things that
came to mind as I saw them.
A. The most-often-used delimiter for regular expressions in actual
perl code is slashes:
if ( $string =~ m/regex/ ) { ... }
To the point where the "m" can be omitted from the above example,
and it will still work exactly the same way:
if ( $string =~ /regex/ ) { ... }
(There is a minor niggle if the regex is empty; a literal bare //
in this context does something a bit different. If you know that
the regex can never be empty, then you don't need to worry about
it.)
But if you know you're going to be matching slashes in the regex
itself, you can make your life easier by telling perl to use
something other than slashes as the regex delimiter. My most
common fallback is the exclamation point:
if ( $string =~ m!regex! ) { ... }
Compare:
if ( $path =~ m/^\/etc\/sysconfig\/networking\/devices\/if(\d+)-up$/ )
Against:
if ( $path =~ m!^/etc/sysconfig/networking/devices/if(\d+)-up$! )
B. When you want to match content within matched delimiters, modern
regular expressions have a handy tool called "non-greedy" matching.
In this case, the suggested solution uses this phrase:
/[^/]*/
Which could be read in English as:
a. Match a single literal slash
b. Match zero or more non-slash characters
c. Match another single literal slash
It's made worse by the need to escape those slashes with
backquotes, because they're the default regex delimiter:
\/[^\/]*\/
What we're really trying to say, though, is:
a. Match a single literal slash
b. Match zero or more characters...
c. ... Until the next slash.
The first cut (which doesn't work!) would be:
/.*/
This fails if there are more slashes later in the string; e.g., if
our string is "/hi/there/mom/", then m!/.*/! would match the whole
string. This is called "greedy" matching: the ".*" construct wants
to match as many characters as it possibly can.
We fix this by using non-greedy matching, which is indicated by
following the greedy form with a question mark:
/.*?/
This pattern shows up a lot, especially in quick-and-dirty parsers
for XML/HTML (/"<.*?>"/ or m!<span.*?>.*?</span>!) and for
double-quoted strings (m!\".*?\"!). Those parsers are not 100%
correct but they work for 99% of the cases and are vastly
easier/faster to write and use than completely correct expressions.
C. Anchoring the match to the beginning of a string, the end of a
string, or both.
Perl defaults to "unanchored" matching, while other engines
(boost.regex caught me with this) use different functions / calling
conventions to indicate whether anchoring should be assumed or not.
(In the latter case, the function "match" typically implies
front-and-back anchor, while "search" is unanchored.)
Your samples (and the solution you have working so far) are both
silent on this point. What do you want to happen when you hit a
URL that looks like this:
http://example.com/go?url=http://example.com/something/else/....
That's probably not entirely kosher w.r.t. the URL spec, but such
URLs do exist in the wild. If you are trying to match or block on
"http://example.com/something/else/...", then you need to specify
whether you want the match to happen at the start of the string, or
anywhere in the string.
(This is one reason why I asked you if your interface already split
out the path from the scheme and hostname; an "anchored" match
against the path seems the clearest description of what you're
trying to accomplish.)
It also addresses my first use case, where the desired path is
actually present, but at the second level (underneath an
undesirable path). Anchoring is essential to separating these
cases.
Anyway. Without knowing exactly what you're trying to do, and what
language it's being done in, and what interface you have to the data,
we can only give guesses. (Pretty good ones, since you have a
solution that works! :)
In raw Perl, I'd try something like this:
URL:
foreach my $url ( @urls )
{
my ( $scheme, $rest ) = split /:/, $url, 2;
if ( not $scheme or not defined $rest )
{
die "bad scheme: '$url'";
}
if ( $scheme =~ / ^ https? $ /ix )
{
unless ( $rest =~ m! ^ // ( .*? ) # host
( / .*? )? # path
( \? .* )? $ # query
!x )
{
die "bad http url: '$url'";
}
my ( $host, $path, $query ) = ( $1, $2, $3 );
### now do matching against $path or whatever.
unless ( defined $path )
{
# no path to test
print "pass: $url";
next URL;
}
if ( $path =~ m! ^ /bad-prefix !x )
{
# path starts with bad prefix, flag it
print "fail: $url";
next URL;
}
}
else
{
die "unknown scheme '$scheme': '$url'";
}
}
Yes, this is overwrought, and is probably even slower than single
regex. But would you rather read the above in 6 (or 12, or 48)
months, or something like:
if ( $url =~ m!^https?://.*?/bad-prefix/! )
{
print "fail: $url";
}
else
{
print "pass: $url";
}
Or maybe it's:
if ( $url =~ m!^https?://.*?/good-prefix/! )
{
print "pass: $url";
}
else
{
print "fail: $url";
}
And does the same filtering apply to ftp URLs? (Rare, but still out
there, especially for downloads etc.)
What if you start growing a collection of good or bad prefixes?
my @schemes = qw( http https ftp sftp rsync git );
my $schemes_re = join '|', map quotemeta( $_ ), @schemes;
$schemes_re = qr/$schemes_re/; # unless qr//e ever got implemented...
my @good_prefixes = qw( good1 good2 good3 good4 );
my $good_prefixes_re = join '|', map quotemeta( $_ ), @schemes;
$good_prefixes_re = qr/$schemes_re/;
if ( $url =~ m!^$schemes_re://.*?/$good_prefixes_re/! )
{
print "pass: $url";
}
else
{
print "fail: $url";
}
And so on. Me, I'd rather take the expression apart so I can see
where it's dying. (But that's just me, and perl has had a really
shiny "color debug regex" mode for ... prolly over a decade now, so
maybe I should just learn to use that.)
Happy hacking,
t.
p.s. This isn't the first time I've gone down these roads...
http://lists.community.tummy.com/pipermail/lug/Week-of-Mon-20010820/013173.html
More information about the LUG
mailing list