[lug] awk question

Tue Jan 28 14:51:55 MST 2003

>>>>> "Tkil" == Tkil <tkil at scrye.com> writes:

Tkil> Not in awk, but there's a standard module that comes with perl
Tkil> to do this:

Tkil>   perl -MText::ParseWords=quotewords \
Tkil>     -lnwe 'print join "|", quotewords ",", 0, $_' foo.csv

>>>>> "James" == James Harris <James_Harris at maxtor.com> writes:

James> Ack!, darn -- reason 4,230,304 why I need to learn perl.  ;)
James> Thanks for the information.  Unfortunately I just haven't had
James> the cycles to commit to learning perl yet.  I've tinkered with
James> it for some primitive regex search/replace type simple stuff
James> and had no problems, but in this case I need to do a decent
James> amount of more complex functions, so I'm hesitant to try
James> pulling it out of my you_know_where for this particular case.

James> I'll file this in the KB mail folder and use it as my
James> motivation.  :)

I also highly recommend Jeffrey Freidl's book _Mastering Regular
Expressions_.  He covers umpteen different languages, and how they
handle things.  This particular case -- delmited, potentially quoted
data -- is a common case, and he discusses it at depth.

Overall, though, I hope you do learn perl (for your own sake!).  It
was actually the limitations of awk that caused Larry Wall to create
perl, so you're following in some impressive footsteps.  Also, there
is an "awk to perl" translator, a2p, that might get you started -- but
as with any automatic coding tool, it's not really designed for human-
readable output.

One other way to solve this is to build a simple c program that
understands the very easy state machine that you use to parse this.
Sometimes it's just easier to handle problems character-by-character
than try to second-guess the regular expression engine.  About 20
minutes of work (5 on actual algorithm, 15 minutes making it pretty
and hanking out the fancy escape feature) gave me this:

| /* csv-to-tabs.c
|  *
|  * Filter that converts a simple-minded comma separated values (csv)
|  * file to tab-delimited format, which is often easier to deal with.
|  *
|  * Usage: csv-to-tabs.c < input-csv.txt > output-tabbed.txt
|  *
|  * Written 2003-01-28 by Tkil <tkil at scrye.com>
|  *
|  * Placed into public domain.
|  *
|  */
| 
| #include <stdio.h>
| 
| int main( int argc, char * argv [] )
| {
|     /* describe input stream */
|     char in_sep   = ',';
|     char in_quote = '"';
| 
|     /* field separator to use in output stream */
|     char out_sep  = '\t';
| 
|     /* are we currently quoting anything? */
|     int quoting = 0;
| 
|     /* actual character being considered */
|     int c;
| 
|     while ( ( c = getchar() ) != EOF )
|     {
|         if ( c == in_quote )
|         {
|             /* toggle quoting state */
|             quoting = !quoting;
|         }
|         else if ( c == in_sep && !quoting )
|         {
|             /* new field */
|             putchar( out_sep );
|         }
|         else
|         {
|             /* plain old character */
|             putchar( c );
|         }
|     }
| }

While it's not a very fancy engine, and it has the same failure as
some of the other solutions (that is, if there's a raw tab in the
input data, it won't do anything about it), it will handle the rules
you have in mind.  You could easily have it use \x01 as the tab
separator, which (depending on the next tool in the chain) might work
better.  *shrug*  At least tabs are understood by 'cut(1)' and
friends.

The first version I wrote was actually more complex, as it allowed for
quotes and commas to be backslash-escaped:

| #include <stdio.h>
| 
| int main( int argc, char * argv [] )
| {
|     char in_sep   = ',';
|     char in_quote = '"';
|     char in_esc   = '\\';
| 
|     char out_sep  = '\t';
| 
|     int quoting = 0;
|     int esc_next = 0;
| 
|     int c;
|     while ( ( c = getchar() ) != EOF )
|     {
|         if ( esc_next )
|         {
|             esc_next = 0;
|             putchar(c);
|         }
|         else if ( c == in_esc )
|         {
|             esc_next = 1;
|         }
|         else if ( c == in_quote )
|         {
|             quoting = !quoting;
|         }
|         else if ( c == in_sep && !quoting )
|         {
|             putchar( out_sep );
|         }
|         else
|         {
|             putchar( c );
|         }
|     }
| }

Good luck,
t.