Question

From an external source, I get huge CSV files (around 16GB), that have fields optionally enclosed in double quotes ("). Fields are separated by semicolon (;). When a field contains a double quote in the content, it is escaped as two double quotes.

Currently, I am importing these into a MySQL database, that understands the semantics of "".

I am considering a migration to Amazon Redshift, but they (or probably PostgreSQL in general) demand quotes to be escaped with a backslash as \".

Now I am searching for the fastest command line tool (probably awk, sed?) and the exact syntax to convert my files.

Example input:

"""start of line";"""beginning "" middle and end """;"end of line"""
12345;"Tell me an ""intelligent"" joke; I tell you one in return"
54321;"Your mom is ""nice"""
"";"";""
"However, if;""Quotes""; are present"

Example output:

"\"start of line";"\"beginning \" middle and end \"";"end of line\""
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""
"";"";""
"However, if;\"Quotes\"; are present"

Edit: Added more tests.

Was it helpful?

Solution

There's a couple of edge cases to be aware of:

  • what if doubled double-quotes are at the beginning of a string?
  • What if that string is the first field?
  • a field containing an empty string
sed -r '
    # at the start of a line or the start of a field, 
    # replace """ with "\"
    s/(^|;)"""/\1"\\"/g

    # replace any doubled double-quote with an escaped double-quote.
    # this affects any "inner" quote pair as well as end of field or end of line
    # if there is an escaped quote from the previous command, don't be fooled by
    # a proceeding quote.
    s/([^\\])""/\1\\"/g

    # the above step will destroy empty strings. fix them here.  this uses a
    # conditional loop: if there are 2 consecutive empty fields, they will
    # share a delimited, so we have to process the line more than once
    :fix_empty_fields
    s/(^|;)\\"($|;)/\1""\2/g
    tfix_empty_fields
' <<'END'

"""start of line";"""beginning "" middle and end """;"end of line"""
"";"";"";"""";"""""";"";""

END
"\"start of line";"\"beginning \" middle and end \"";"end of line\""
"";"";"";"\"";"\"\"";"";""

Sed is an efficient tool, but it will take a while with 16GB files. And you better have at least 16GB free disk space to write the updated files (even sed's -i inplace-edit uses temp files behind the scenes)

refs: GNU sed manual, sed looping commands

OTHER TIPS

I would use sed, as you suggest in your post:

$ sed 's@""@\\"@g' input
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""

I would go for using sed:

$ sed 's:"":\\":g' your_csv.csv

When testing it on the following:

new """
test ""
"hows "" this "" "

I got:

new \""
test \"
"hows \" this \" "

this line should work:

sed 's/""/\\"/g' file

With sed:

sed 's/""/\\"/g' input_file

Test:

$ cat n.txt 
12345;"Tell me an ""intelligent"" joke; I tell you one in return"
54321;"Your mom is ""nice"""

$ sed 's/""/\\"/g' n.txt 
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top