[lug] parsing between two lists

Thu Mar 28 13:47:30 MST 2002

>>>>> "Rob" == Rob Riggs <Riggs> writes:

Rob> Unique in list A: diff -u listA listB | grep ^- | sed 's/^.//g'
Rob> Unique in list B: diff -u listA listB | grep ^+ | sed 's/^.//g'
Rob> Common to both: diff -u listA listB | grep "^ " | sed 's/^.//g'

hm.  now that i think about it, this version of "common to both"
probably won't work -- because "-u" only keeps 3 lines of context [by
default] in its output.  so, the two files:

   file1  file2
   a      a
   b      b
   c      c
   d      d
   e      e
   f
   g      g

the "-u" output should only have

     c
     d
     e
   + f
     g

thus dropping "a" and "b".  let me see if i got it right...

i think the least intrusive fix is something reasonably efficient,
thanks to the mergesort capability of "sort -m":

   sort -m file1 file2 | uniq -c | grep '^ *2' | cut -f2- > common

if we wanted to stick with a perl script, we can take advantage of the
fact that they're already sorted.  instead of allocating a big hash
table for all the entries in one file, we could do something like this
instead:

   my $first = <FIRST>;
   my $second = <SECOND>;

   while (defined($first) && defined($second))
   {
       if ($first lt $second)
       {
           print IN_FIRST $first;
           $first = <FIRST>;
       }
       elsif ($first gt $second)
       {
           print IN_SECOND $second;
           $second = <SECOND>;
       }
       else
       {
           print IN_COMMON $first;
           $first = <FIRST>;
           $second = <SECOND>;
       }
   }

   # take care of stragglers
   if (defined($first))
   {
       while ($first = <FIRST>)   { print $first; }
   }
   else
   {
       while ($second = <SECOND>) { print $second }
   }

t.