[lug] Regex Help

Sun Jul 10 01:02:06 MDT 2011

"George Sexton" <georges at mhsoftware.com> writes:

> I'm using an application that matches regular expressions in URLs.
>
> I'd like it to match
>
> /somepath/*
>
> But not
>
> /somethingelse/somepath/*
>
> I can write an expression to match /somepath/*. The problem is it's matching
> the second thing which I don't want.
>
> I don't get to write a lot of code. 
>
> I don't know what the host name will be. It might be a fqdn, might be an IP
> Address.
>
> The input has the full URL syntax:
>
> Scheme:hostname/path/

Ok.  A few corner cases:

1. Should we match on:

      http://example.com/somepath/somepath/... ?

   That is, are you actually testing for a lack of second level?

   (My current guess is that you're writing ad-blocking regexes, and
   want to match "http://example.com/ad/..." but not
   "http://example.com/foo/ad/...")

2. Does your engine happen to parse the url into { scheme, host, path
   } parts, as you indicate?  (And possibly also query string)?

Anyway, the other suggestions are perfectly valid, and it sounds like
you have a solution that works.  I'll just point out a few things that
came to mind as I saw them.

A. The most-often-used delimiter for regular expressions in actual
   perl code is slashes:

      if ( $string =~ m/regex/ ) { ... }

   To the point where the "m" can be omitted from the above example,
   and it will still work exactly the same way:

      if ( $string =~ /regex/ ) { ... }

   (There is a minor niggle if the regex is empty; a literal bare //
   in this context does something a bit different.  If you know that
   the regex can never be empty, then you don't need to worry about
   it.)

   But if you know you're going to be matching slashes in the regex
   itself, you can make your life easier by telling perl to use
   something other than slashes as the regex delimiter.  My most
   common fallback is the exclamation point:

      if ( $string =~ m!regex! ) { ... }

   Compare:

      if ( $path =~ m/^\/etc\/sysconfig\/networking\/devices\/if(\d+)-up$/ )

   Against:

      if ( $path =~ m!^/etc/sysconfig/networking/devices/if(\d+)-up$! )

B. When you want to match content within matched delimiters, modern
   regular expressions have a handy tool called "non-greedy" matching.

   In this case, the suggested solution uses this phrase:

      /[^/]*/

   Which could be read in English as:

      a. Match a single literal slash
      b. Match zero or more non-slash characters
      c. Match another single literal slash

   It's made worse by the need to escape those slashes with
   backquotes, because they're the default regex delimiter:

      \/[^\/]*\/

   What we're really trying to say, though, is:

      a. Match a single literal slash
      b. Match zero or more characters...
      c. ... Until the next slash.

   The first cut (which doesn't work!) would be:

      /.*/

   This fails if there are more slashes later in the string; e.g., if
   our string is "/hi/there/mom/", then m!/.*/! would match the whole
   string.  This is called "greedy" matching: the ".*" construct wants
   to match as many characters as it possibly can.

   We fix this by using non-greedy matching, which is indicated by
   following the greedy form with a question mark:

      /.*?/

   This pattern shows up a lot, especially in quick-and-dirty parsers
   for XML/HTML (/"<.*?>"/ or m!<span.*?>.*?</span>!)  and for
   double-quoted strings (m!\".*?\"!).  Those parsers are not 100%
   correct but they work for 99% of the cases and are vastly
   easier/faster to write and use than completely correct expressions.

C. Anchoring the match to the beginning of a string, the end of a
   string, or both.

   Perl defaults to "unanchored" matching, while other engines
   (boost.regex caught me with this) use different functions / calling
   conventions to indicate whether anchoring should be assumed or not.
   (In the latter case, the function "match" typically implies
   front-and-back anchor, while "search" is unanchored.)

   Your samples (and the solution you have working so far) are both
   silent on this point.  What do you want to happen when you hit a
   URL that looks like this:

      http://example.com/go?url=http://example.com/something/else/....

   That's probably not entirely kosher w.r.t. the URL spec, but such
   URLs do exist in the wild.  If you are trying to match or block on
   "http://example.com/something/else/...", then you need to specify
   whether you want the match to happen at the start of the string, or
   anywhere in the string.

   (This is one reason why I asked you if your interface already split
   out the path from the scheme and hostname; an "anchored" match
   against the path seems the clearest description of what you're
   trying to accomplish.)

   It also addresses my first use case, where the desired path is
   actually present, but at the second level (underneath an
   undesirable path).  Anchoring is essential to separating these
   cases.

Anyway.  Without knowing exactly what you're trying to do, and what
language it's being done in, and what interface you have to the data,
we can only give guesses.  (Pretty good ones, since you have a
solution that works!  :)

In raw Perl, I'd try something like this:

   URL:
   foreach my $url ( @urls )
   {

       my ( $scheme, $rest ) = split /:/, $url, 2;
       if ( not $scheme or not defined $rest )
       {
           die "bad scheme: '$url'";
       }

       if ( $scheme =~ / ^ https? $ /ix )
       {
           unless ( $rest =~ m! ^ // (    .*? )    # host
                                     ( /  .*? )?   # path
                                     ( \? .*  )? $ # query
                              !x )
           {
               die "bad http url: '$url'";
           }

           my ( $host, $path, $query ) = ( $1, $2, $3 );

           ### now do matching against $path or whatever.

           unless ( defined $path )
           {
               # no path to test
               print "pass: $url";
               next URL;
           }

           if ( $path =~ m! ^ /bad-prefix !x )
           {
               # path starts with bad prefix, flag it
               print "fail: $url";
               next URL;
           }

       }
       else
       {
           die "unknown scheme '$scheme': '$url'";
       }

   }

Yes, this is overwrought, and is probably even slower than single
regex.  But would you rather read the above in 6 (or 12, or 48)
months, or something like:

   if ( $url =~ m!^https?://.*?/bad-prefix/! )
   {
       print "fail: $url";
   }
   else
   {
       print "pass: $url";
   }

Or maybe it's:

   if ( $url =~ m!^https?://.*?/good-prefix/! )
   {
       print "pass: $url";
   }
   else
   {
       print "fail: $url";
   }

And does the same filtering apply to ftp URLs?  (Rare, but still out
there, especially for downloads etc.)

What if you start growing a collection of good or bad prefixes?

   my @schemes = qw( http https ftp sftp rsync git );
   my $schemes_re = join '|', map quotemeta( $_ ), @schemes;
   $schemes_re = qr/$schemes_re/; # unless qr//e ever got implemented...

   my @good_prefixes = qw( good1 good2 good3 good4 );
   my $good_prefixes_re = join '|', map quotemeta( $_ ), @schemes;
   $good_prefixes_re = qr/$schemes_re/;

   if ( $url =~ m!^$schemes_re://.*?/$good_prefixes_re/! )
   {
       print "pass: $url";
   }
   else
   {
       print "fail: $url";
   }

And so on.  Me, I'd rather take the expression apart so I can see
where it's dying.  (But that's just me, and perl has had a really
shiny "color debug regex" mode for ... prolly over a decade now, so
maybe I should just learn to use that.)

Happy hacking,
t.

p.s. This isn't the first time I've gone down these roads...

     http://lists.community.tummy.com/pipermail/lug/Week-of-Mon-20010820/013173.html