[lug] finding text lines in a single file

Tue Apr 27 14:38:36 MDT 2004

Thanks!  I will have to store that away for future use as well.
The file could potentially be quite large. So optimization is a good thing.

Thanks again,
Carl.

-----Original Message-----
From: Tkil [mailto:tkil at scrye.com]
Sent: Tuesday, April 27, 2004 2:26 PM
To: Wagner, Carl
Cc: Boulder (Colorado) Linux Users Group -- General Mailing List
Subject: Re: [lug] finding text lines in a single file

>>>>> "Carl" == Carl Wagner <Wagner> writes:

Carl> That should do it.  Like I said, about 10 seconds.

If LogFile is really long, though, scanning through it multiple times
will be very slow.  A better technique is to build a single regex with
all the candidates to match, then scan the log file once.

Not sure how to do it in just shell, but in perl (at a sh-ish prompt):

perl -we 'my $re = join "|", @ARGV;
          while (<>) { print if /$re/o }' $( cat EntryFile ) < LogFile

In pure perl:

| #!/usr/bin/perl
| 
| use strict;
| use warnings;
| 
| require 5.006;
| 
| unless ( @ARGV == 2 )
| {
|     die "usage: $0 EntryFile LogFile";
| }
| 
| my ( $entry_file, $log_file ) = @ARGV;
| 
| my $entry_re = do
| {
|     open my $entry_fh, $entry_file
|       or die "$0: opening $entry_file: $!";
|     my $re = join '|',
|                grep { /\S/ }                    # anything left?
|                  map { s/^\s+//; s/\s+$//; $_ } # remove whitespace
|                    <$entry_fh>;
|     qr/$re/
| };
| 
| open my $log_fh, $log_file
|   or die "$0: opening $log_file: $!";
| while ( <$log_fh> )
| {
|     print if m/$entry_re/;
| }
| close $log_fh
|   or die "$0: closing $log_file: $!";
| 
| exit 0;

Oh, duh, you can build up a regex almost as easily in the shell:

| #!/bin/bash
| 
| entry_file=$1
| log_file=$2
| 
| re=""
| sep=""
| for i in $( cat $entry_file )
| do
|   re="$re$sep$i";
|   sep="|";
| done
| 
| exec egrep "$re" $log_file

Note that both of these solutions are likely to explode if you use
special characters in EntryFile; in the shell case, even whitespace
will be enough to [possibly] cause spurious matches.

Here's what I tested against.

| $ perl -lwe 'for ( 1 .. 100 ) {
|                   printf "%03d 0x%04x\n", ( 100+rand(100) ) x 2
|              }' > carl1-log.txt
|
| $ cat carl1-log.txt 
| 182 0x00b6
| 126 0x007e
| 128 0x0080
| [...]
| 184 0x00b8
| 140 0x008c
| 127 0x007f
| 165 0x00a5
| 146 0x0092
|
| $ cat carl1-entries.txt 
|    123
|    124
|    125
|    126
|
| $ ./carl1.plx carl1-entries.txt carl1-log.txt
| 126 0x007e
| 125 0x007d
|
| $ ./carl1.sh carl1-entries.txt carl1-log.txt
| 126 0x007e
| 125 0x007d

t.