[lug] Perl Question..
Tkil
tkil at scrye.com
Thu Apr 17 19:37:05 MDT 2003
>>>>> "Joey" == Joseph McDonald <joem at uu.net> writes:
Joey> Well, that is a good point. A seperate process creates monthly
Joey> reporting files in a directory and it names them something like
Joey> this: '$month.$year.html'.
If you have *any* influence over this process, do yourself (and your
company) a favor by converting to the ISO-8601 date format:
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
In this case, using "YYYY-MM" (all numeric) would have made this a
non-problem, as that would sort properly even with a simple
lexigraphic sort.
Joey> I was initially working on a predefined hash, but I ended up
Joey> reading up on sort() and came up with this:
Various meditations.
Joey> my @files = <*.html>;
Joey> my @var = sort month_year_sort @files;
Joey>
Joey> for (@var) {
Joey> chomp;
Joey> my ($month, $year, $ext) = split /\./;
Joey> print "<a href=\"$_\">$month -- $year</A><BR>\n";
Joey> }
Joey>
Joey> sub month_year_sort
Joey> {
Joey> my $i = 1;
Joey> my $status = 1;
Joey>
Joey> my $int_a = int_month($a);
Joey> my $int_b = int_month($b);
Joey>
Joey> $a =~ /\w+\.(\d{4})\.html/;
Joey> $a_year = $1;
Joey> $b =~ /\w+\.(\d{4})\.html/;
Joey> $b_year = $1;
Joey>
Joey> return $a_year <=> $b_year || $int_a <=> $int_b;
Joey> }
Taking these two together, you're doing work multiple times: you're
splitting by month and year twice per file (at least); also, remember
that month_year_sort will be called N*log2(N) times, with all the work
associated.
1. As a first step, do the work of parsing and translating month
numbers only once per file:
| my @files = glob '*.html';
|
| my ( %year, %month, %int_month );
| foreach my $file ( @files )
| {
| my ( $month, $year ) = split /\./, $file;
| $year{$file} = $year;
| $month{$file} = $month;
| $int_month{$file} = int_month $month;
| }
|
| my @sorted_files = sort { $year{$a} <=> $year{$b} ||
| $int_month{$a} <=> $int_month{$b} } @files;
|
| foreach my $file ( @sorted_files )
| {
| print "<a href=\"$file\">$month{$file} - $year{$file}</a><br />\n";
| }
A nice trick for putting double quotes in a double-quoted string is
to use Perl's "qq" operator (perldoc perlop). With only two, it's
not that big a deal, but when writing XML tags that use double
quotes around everything, they're very handy. The print statement
becomes:
| print qq|<a href="$file">$month{$file} - $year{$file}</a><br />\n|;
For longer chunks of text, you can also see similar improvements in
readability by using here-docs. Not worth it for one line at a
time, tho.
2. Instead of using hashes to look up the values for each file, we can
simply store those derived values "next to" the filename itself by
using a list-of-lists
| my @files = glob '*.html';
|
| my @augmented_files;
| foreach my $file ( @files )
| {
| my ( $month, $year ) = split /\./, $file;
| push @augmented_files, [ $file, $year, $month, int_month($month) ];
| }
|
| my @sorted_files = sort { $a->[1] <=> $b->[1] ||
| $a->[3] <=> $b->[3] } @augmented_files;
|
| foreach my $file_aref ( @sorted_files )
| {
| my ( $file, $year, $month, $int_month ) = @$file_aref;
| print qq|<a href="$file">$month - $year{</a><br />\n|;
| }
3. Looking at the first "foreach" above, we see that we're taking a
list in, and spitting a list out. This is a common use for "map":
| my @files = glob '*.html';
|
| my @augmented_files = map
| {
| my ( $month, $year ) = split /\./, $file;
| [ $file, $year, $month, int_month($month) ];
| } @files;
|
| my @sorted_files = sort { $a->[1] <=> $b->[1] ||
| $a->[3] <=> $b->[3] } @augmented_files;
|
| foreach my $file_aref ( @sorted_files )
| {
| my ( $file, $year, $month, $int_month ) = @$file_aref;
| print qq|<a href="$file">$month - $year{</a><br />\n|;
| }
4. Note that we have three stages that generate lists. Since we
process each member once, we can actually cascade them.
| my @files =
| sort { $a->[1] <=> $b->[1] || $a->[3] <=> $b->[3] }
| map {
| my ( $month, $year ) = split /\./, $file;
| [ $file, $year, $month, int_month($month) ];
| } glob '*.html';
|
| foreach my $file_aref ( @files )
| {
| my ( $file, $year, $month, $int_month ) = @$file_aref;
| print qq|<a href="$file">$month - $year{</a><br />\n|;
| }
This idea -- find the expensive sort keys for each element, augment
the original list, sort the augmented list, then do stuff with the
sorted output -- is commonly called the "Schwartzian Transform".
5. The last stage is to notice that the final loop is really just
going over the same elements again; unsurprisingly, we can use map
for that, too:
| print
| map { qq|<a href="$_->[0]">$_->[2] - $_->[1]</a><br />\n| }
| sort { $a->[1] <=> $b->[1] || $a->[3] <=> $b->[3] }
| map {
| my ( $month, $year ) = split /\./, $file;
| [ $file, $year, $month, int_month($month) ];
| } glob '*.html';
6. Instead of using two stages, how about formatting into ISO-8601
style (or close) so we can just use a string comparison?
| print
| map { qq|<a href="$_->[0]">$_->[2] - $_->[1]</a><br />\n| }
| sort { $a->[4] cmp $b->[4] }
| map {
| my ( $month, $year ) = split /\./, $file;
| my $key = sprintf "%04d-%02d", $year, int_month($month);
| [ $file, $year, $month, $key ];
| } glob '*.html';
7. Let's revisit the idea of keeping $year and $month around:
| print
| map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$1 - $2</a><br />\n:; $_ }
| map { $_->[0] }
| sort { $a->[1] cmp $b->[1] }
| map {
| my ( $month, $year ) = split /\./, $file;
| [ $file, sprintf "%04d-%02d", $year, int_month($month) ];
| } glob '*.html';
8. And, just to get rid of that two-line map phase, we use another map
and a dirty trick. (Since the file starts with the month name, and
we only look at the front of the month passed into "int_month"...)
| print
| map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n:; $_ }
| map { $_->[0] }
| sort { $a->[1] cmp $b->[1] }
| map { [ $_->[0], sprintf "%04d-%02d", $_->[2], int_month($_) ] }
| map { [ $_, split /\./, $_ ] }
| glob '*.html';
9. The more polite solution is to move the complex key calculation off
to the side:
| sub file_sort_key ( $ )
| {
| my ( $month, $year ) = split /\./, $_[0];
| return sprintf "%04d-%02d", $year, int_month($month);
| }
|
| print
| map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n: }
| map { $_->[0] }
| sort { $a->[1] cmp $b->[1] }
| map { [ $_, file_sort_key $_ ] }
| glob '*.html';
And looking at this routine:
Joey> sub int_month
Joey> {
Joey> @months = qw ( Jan Feb Mar Apr May June Jul Aug Sep Oct Nov Dec );
Joey> my $i = 1;
Joey> foreach $month ( @months ) {
Joey> return $i if $_[0] =~ /^$month/;
Joey> $i++;
Joey> }
Joey> }
1. You should probably name your parameter, instead of using "$_[0]".
Also, @months and $month should both have limited scope (so use a
"my").
| sub int_month
| {
| my ( $in_month ) = @_;
| my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec );
| my $i = 1;
| foreach my $month ( @months ) {
| return $i if $in_month =~ /^$month/;
| $i++;
| }
| }
2. This loop requires perl to rebuild that regex every time. If you
know you're only matching the first three letters of each month,
use that to your advantage. (Notice that I changed "June" to "Jun"
for consistency.)
| sub int_month
| {
| my ( $in_month ) = @_;
| my $in_abbrev = substr( $in_month, 0, 3 );
| my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec );
| my $i = 1;
| foreach my $month ( @months ) {
| return $i if $in_abbrev eq $month;
| $i++;
| }
| }
3. Instead of looping explicitly, we can have perl "loop" internally
by using a single string with all the month names in it:
| sub int_month
| {
| my ( $in_month ) = @_;
| my $in_abbrev = substr( $in_month, 0, 3 );
| my $all_months = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec';
| my $i = index( $all_months, $in_abbrev );
| if ( $i > -1 )
| {
| return 1 + $i / 4;
| }
| else
| {
| return;
| }
| }
4. And a few final tweaks: addition of prototype; addition of scope
for "static" months string (only init it once, but not visible
anywhere outside this scope); only bother grabbing $in_abbrev,
instead of naming the parameter directly for just one other use;
add spaces to front of $MONTHS so that we can just add one then
divide by 4 (returning 0 for "not found"); use of ucfirst to make
sure they match the case we use on the way in.
| {
| my $MONTHS = ' Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec';
|
| sub int_month ( $ )
| {
| my $in_abbrev = ucfirst substr $_[0], 0, 3;
| return ( index( $MONTHS, $in_abbrev ) + 1 ) / 4;
| }
| }
It might actually be faster to switch back to a hash at this point:
| {
| my %MONTHS;
| @MONTHS{ qw( Jan Feb Mar Apr May Jun
| Jul Aug Sep Oct Nov Dec ) } = 1 .. 12;
|
| sub int_month ( $ )
| {
| my $in_abbrev = ucfirst substr $_[0], 0, 3;
| return $MONTHS{$in_abbrev} || 0;
| }
| }
I don't know which would be better; benchmarking is the way to go
at this point.
Putting all that together, here's my test program:
| #!/usr/bin/perl -w
|
| use strict;
|
| {
| my %MONTHS;
| @MONTHS{ qw( Jan Feb Mar Apr May Jun
| Jul Aug Sep Oct Nov Dec ) } = 1 .. 12;
|
| sub int_month ( $ )
| {
| my $in_abbrev = ucfirst substr $_[0], 0, 3;
| return $MONTHS{$in_abbrev} || 0;
| }
| }
|
| sub file_sort_key ( $ )
| {
| my ( $month, $year ) = split /\./, $_[0];
| return sprintf "%04d-%02d", $year, int_month($month);
| }
|
| print
| map { s:^((\w+)\.(\d+)\..*)$:<a href="$1">$2 - $3</a><br />\n:; $_ }
| map { $_->[0] }
| sort { $a->[1] cmp $b->[1] }
| map { [ $_, file_sort_key $_ ] }
| # glob '*.html';
| grep { $_ } map { s/\s+\z//; $_ } <DATA>;
|
| __DATA__
| August.1998.html
| December.900.html
| December.1999.html
| January.1997.html
| November.2003.html
Which produces the desired output:
| $ ./joey1.plx
| <a href="December.900.html">December - 900</a><br />
| <a href="January.1997.html">January - 1997</a><br />
| <a href="August.1998.html">August - 1998</a><br />
| <a href="December.1999.html">December - 1999</a><br />
| <a href="November.2003.html">November - 2003</a><br />
More information about the LUG
mailing list