[lug] Perl Question..

Thu Apr 17 19:37:05 MDT 2003

>>>>> "Joey" == Joseph McDonald <joem at uu.net> writes:

Joey> Well, that is a good point. A seperate process creates monthly
Joey> reporting files in a directory and it names them something like
Joey> this: '$month.$year.html'.

If you have *any* influence over this process, do yourself (and your
company) a favor by converting to the ISO-8601 date format:

   http://www.cl.cam.ac.uk/~mgk25/iso-time.html

In this case, using "YYYY-MM" (all numeric) would have made this a
non-problem, as that would sort properly even with a simple
lexigraphic sort.

Joey> I was initially working on a predefined hash, but I ended up
Joey> reading up on sort() and came up with this:

Various meditations.

Joey> my @files = <*.html>;
Joey> my @var = sort month_year_sort @files;
Joey>
Joey> for (@var) {
Joey>      chomp;
Joey>      my ($month, $year, $ext) = split /\./;
Joey>      print "<a href=\"$_\">$month -- $year</A><BR>\n";
Joey> }
Joey>
Joey> sub month_year_sort
Joey> {
Joey>    my $i = 1;
Joey>    my $status = 1;
Joey>
Joey>    my $int_a = int_month($a);
Joey>    my $int_b = int_month($b);
Joey>
Joey>    $a =~ /\w+\.(\d{4})\.html/;
Joey>    $a_year = $1;
Joey>    $b =~ /\w+\.(\d{4})\.html/;
Joey>    $b_year = $1;
Joey>
Joey>    return $a_year <=> $b_year || $int_a <=> $int_b;
Joey> }

Taking these two together, you're doing work multiple times: you're
splitting by month and year twice per file (at least); also, remember
that month_year_sort will be called N*log2(N) times, with all the work
associated.

1. As a first step, do the work of parsing and translating month
   numbers only once per file:

   | my @files = glob '*.html';
   | 
   | my ( %year, %month, %int_month );
   | foreach my $file ( @files )
   | {
   |     my ( $month, $year ) = split /\./, $file;
   |     $year{$file} = $year;
   |     $month{$file} = $month;
   |     $int_month{$file} = int_month $month;
   | }
   | 
   | my @sorted_files = sort { $year{$a} <=> $year{$b} ||
   |                           $int_month{$a} <=> $int_month{$b} } @files;
   | 
   | foreach my $file ( @sorted_files )
   | {
   |     print "<a href=\"$file\">$month{$file} - $year{$file}</a><br />\n";
   | }

   A nice trick for putting double quotes in a double-quoted string is
   to use Perl's "qq" operator (perldoc perlop).  With only two, it's
   not that big a deal, but when writing XML tags that use double
   quotes around everything, they're very handy.  The print statement
   becomes:

   |     print qq|<a href="$file">$month{$file} - $year{$file}</a><br />\n|;

   For longer chunks of text, you can also see similar improvements in
   readability by using here-docs.  Not worth it for one line at a
   time, tho.

2. Instead of using hashes to look up the values for each file, we can
   simply store those derived values "next to" the filename itself by
   using a list-of-lists

   | my @files = glob '*.html';
   | 
   | my @augmented_files;
   | foreach my $file ( @files )
   | {
   |     my ( $month, $year ) = split /\./, $file;
   |     push @augmented_files, [ $file, $year, $month, int_month($month) ];
   | }
   | 
   | my @sorted_files = sort { $a->[1] <=> $b->[1] ||
   |                           $a->[3] <=> $b->[3] } @augmented_files;
   | 
   | foreach my $file_aref ( @sorted_files )
   | {
   |     my ( $file, $year, $month, $int_month ) = @$file_aref;
   |     print qq|<a href="$file">$month - $year{</a><br />\n|;
   | }

3. Looking at the first "foreach" above, we see that we're taking a
   list in, and spitting a list out.  This is a common use for "map":

   | my @files = glob '*.html';
   | 
   | my @augmented_files = map
   | {
   |     my ( $month, $year ) = split /\./, $file;
   |     [ $file, $year, $month, int_month($month) ];
   | } @files;
   | 
   | my @sorted_files = sort { $a->[1] <=> $b->[1] ||
   |                           $a->[3] <=> $b->[3] } @augmented_files;
   | 
   | foreach my $file_aref ( @sorted_files )
   | {
   |     my ( $file, $year, $month, $int_month ) = @$file_aref;
   |     print qq|<a href="$file">$month - $year{</a><br />\n|;
   | }

4. Note that we have three stages that generate lists.  Since we
   process each member once, we can actually cascade them.

   | my @files =
   |   sort { $a->[1] <=> $b->[1] || $a->[3] <=> $b->[3] }
   |     map {
   |         my ( $month, $year ) = split /\./, $file;
   |         [ $file, $year, $month, int_month($month) ];
   |     } glob '*.html';
   | 
   | foreach my $file_aref ( @files )
   | {
   |     my ( $file, $year, $month, $int_month ) = @$file_aref;
   |     print qq|<a href="$file">$month - $year{</a><br />\n|;
   | }

   This idea -- find the expensive sort keys for each element, augment
   the original list, sort the augmented list, then do stuff with the
   sorted output -- is commonly called the "Schwartzian Transform".

5. The last stage is to notice that the final loop is really just
   going over the same elements again; unsurprisingly, we can use map
   for that, too:

   | print 
   |   map { qq|<a href="$_->[0]">$_->[2] - $_->[1]</a><br />\n| }
   |     sort { $a->[1] <=> $b->[1] || $a->[3] <=> $b->[3] }
   |       map {
   |           my ( $month, $year ) = split /\./, $file;
   |           [ $file, $year, $month, int_month($month) ];
   |       } glob '*.html';

6. Instead of using two stages, how about formatting into ISO-8601
   style (or close) so we can just use a string comparison?

   | print 
   |   map { qq|<a href="$_->[0]">$_->[2] - $_->[1]</a><br />\n| }
   |     sort { $a->[4] cmp $b->[4] }
   |       map {
   |           my ( $month, $year ) = split /\./, $file;
   |           my $key = sprintf "%04d-%02d", $year, int_month($month);
   |           [ $file, $year, $month, $key ];
   |       } glob '*.html';

7. Let's revisit the idea of keeping $year and $month around:

   | print 
   |   map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$1 - $2</a><br />\n:; $_ }
   |     map { $_->[0] }
   |       sort { $a->[1] cmp $b->[1] }
   |         map {
   |             my ( $month, $year ) = split /\./, $file;
   |             [ $file, sprintf "%04d-%02d", $year, int_month($month) ];
   |         } glob '*.html';

8. And, just to get rid of that two-line map phase, we use another map
   and a dirty trick.  (Since the file starts with the month name, and
   we only look at the front of the month passed into "int_month"...)

   | print 
   |   map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n:; $_ }
   |     map { $_->[0] }
   |       sort { $a->[1] cmp $b->[1] }
   |         map { [ $_->[0], sprintf "%04d-%02d", $_->[2], int_month($_) ] }
   |           map { [ $_, split /\./, $_ ] }
   |             glob '*.html';

9. The more polite solution is to move the complex key calculation off
   to the side:

   | sub file_sort_key ( $ )
   | {
   |     my ( $month, $year ) = split /\./, $_[0];
   |     return sprintf "%04d-%02d", $year, int_month($month);
   | }
   |
   | print 
   |   map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n: }
   |     map { $_->[0] }
   |       sort { $a->[1] cmp $b->[1] }
   |         map { [ $_, file_sort_key $_ ] }
   |           glob '*.html';

And looking at this routine:

Joey> sub int_month
Joey> {
Joey>    @months = qw ( Jan Feb Mar Apr May June Jul Aug Sep Oct Nov Dec );
Joey>    my $i = 1;
Joey>    foreach $month ( @months ) {
Joey>         return $i if  $_[0] =~ /^$month/;
Joey>         $i++;
Joey>    }
Joey> }

1. You should probably name your parameter, instead of using "$_[0]".
   Also, @months and $month should both have limited scope (so use a
   "my").

   | sub int_month
   | {
   |     my ( $in_month ) = @_;
   |     my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec );
   |     my $i = 1;
   |     foreach my $month ( @months ) {
   |          return $i if $in_month =~ /^$month/;
   |          $i++;
   |     }
   | }

2. This loop requires perl to rebuild that regex every time.  If you
   know you're only matching the first three letters of each month,
   use that to your advantage.  (Notice that I changed "June" to "Jun"
   for consistency.)

   | sub int_month
   | {
   |     my ( $in_month ) = @_;
   |     my $in_abbrev = substr( $in_month, 0, 3 );
   |     my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec );
   |     my $i = 1;
   |     foreach my $month ( @months ) {
   |          return $i if $in_abbrev eq $month;
   |          $i++;
   |     }
   | }

3. Instead of looping explicitly, we can have perl "loop" internally
   by using a single string with all the month names in it:

   | sub int_month
   | {
   |     my ( $in_month ) = @_;
   |     my $in_abbrev = substr( $in_month, 0, 3 );
   |     my $all_months = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec';
   |     my $i = index( $all_months, $in_abbrev );
   |     if ( $i > -1 )
   |     {
   |          return 1 + $i / 4;
   |     }
   |     else
   |     {
   |          return;
   |     }
   | }

4. And a few final tweaks: addition of prototype; addition of scope
   for "static" months string (only init it once, but not visible
   anywhere outside this scope); only bother grabbing $in_abbrev,
   instead of naming the parameter directly for just one other use;
   add spaces to front of $MONTHS so that we can just add one then
   divide by 4 (returning 0 for "not found"); use of ucfirst to make
   sure they match the case we use on the way in.

   | {
   |     my $MONTHS = '   Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec';
   | 
   |     sub int_month ( $ )
   |     {
   |         my $in_abbrev = ucfirst substr $_[0], 0, 3;
   |         return ( index( $MONTHS, $in_abbrev ) + 1 ) / 4;
   |     }
   | }

   It might actually be faster to switch back to a hash at this point:

   | {
   |     my %MONTHS;
   |     @MONTHS{ qw( Jan Feb Mar Apr May Jun
   |                  Jul Aug Sep Oct Nov Dec ) } = 1 .. 12;
   |        
   |     sub int_month ( $ )
   |     {
   |         my $in_abbrev = ucfirst substr $_[0], 0, 3;
   |         return $MONTHS{$in_abbrev} || 0;
   |     }
   | }

   I don't know which would be better; benchmarking is the way to go
   at this point.

Putting all that together, here's my test program:

   | #!/usr/bin/perl -w
   | 
   | use strict;
   | 
   | {
   |     my %MONTHS;
   |     @MONTHS{ qw( Jan Feb Mar Apr May Jun
   |                  Jul Aug Sep Oct Nov Dec ) } = 1 .. 12;
   | 
   |     sub int_month ( $ )
   |     {
   |         my $in_abbrev = ucfirst substr $_[0], 0, 3;
   |         return $MONTHS{$in_abbrev} || 0;
   |     }
   | }
   | 
   | sub file_sort_key ( $ )
   | {
   |     my ( $month, $year ) = split /\./, $_[0];
   |     return sprintf "%04d-%02d", $year, int_month($month);
   | }
   | 
   | print
   |   map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n:; $_ }
   |     map { $_->[0] }
   |       sort { $a->[1] cmp $b->[1] }
   |         map { [ $_, file_sort_key $_ ] }
   |           # glob '*.html';
   |           grep { $_ } map { s/\s+\z//; $_ } <DATA>;
   | 
   | __DATA__
   | August.1998.html
   | December.900.html
   | December.1999.html
   | January.1997.html
   | November.2003.html

Which produces the desired output:

   | $ ./joey1.plx
   | <a href="December.900.html">December - 900</a><br />
   | <a href="January.1997.html">January - 1997</a><br />
   | <a href="August.1998.html">August - 1998</a><br />
   | <a href="December.1999.html">December - 1999</a><br />
   | <a href="November.2003.html">November - 2003</a><br />