Can awk deal with CSV file that contains comma inside a quoted field?

https://stackoverflow.com/questions/3138363

01-10-2019
|

Question

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:

id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

I was using this awk script to count the sum:

awk -F, '{sum+=$3} END {print sum}'

Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?

Thank you.

Solution

you write a function in awk like below:

$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1

you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

OTHER TIPS

One way using GNU awk and FPAT

awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt

Result:

You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.

You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.

Here is the command:

csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'

see https://github.com/dbro/csvquote for the code

I am using

`FPAT="([^,]+)|(\"[^\"]+\")" `

to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field. I changed it to:

`FPAT="([^,]*)|(\"[^\"]*\")"`

and replace "+" with "*". It works correctly.

I also find that GNU Awk User Guide also has this problem. https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:

$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }

{
   decsv()

   for (i=1;i<=NF;i++) {
       printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
   }
   print ""
}

function decsv(         curr,head,tail)
{
   tail = $0
   while ( match(tail,/"[^"]+"/) ) {
       head = substr(tail, 1, RSTART-1);
       gsub(fs,RS,head)
       curr = curr head substr(tail, RSTART, RLENGTH)
       tail = substr(tail, RSTART + RLENGTH)
   }
   gsub(fs,RS,tail)
   $0 = curr tail
}

$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>

Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>

Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>

Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>

It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...

See What's the most robust way to efficiently parse CSV using awk? for more information.

You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.

Update(as Dennis requested). A simple example

$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'

$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99

$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question

As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.

If you know for sure that the 'value' column is always the last column:

awk -F, '{sum+=$NF} END {print sum}'

NF represents the number of fields, so $NF is the last column

This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.

If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!

grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile

The solution that empties the fields with commas within them but also maintains the record, of course is:

awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}

Megs of thanks to ghostdog74 for the great solution!

NetsGuy256/

FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:

awk -F"," '{sum+=$NF} END {print sum}'

To access the second to last column, you would use this:

awk -F"," '{sum+=$(NF-1)} END {print sum}'

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){@f=$csv->fields();$sum+=$f[2]} END{print $sum}' file

allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow