Question

I am looking for some best practices as far as handling csv and tab delimited files.

For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?

Same question for tab delimited files. I assume the answer would be very similar if not the same.

Was it helpful?

Solution

Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.

But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.

OTHER TIPS

@Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.

As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).

This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.

For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html

For obvious reasons, most such conventions adhere to the following at a minimum:

   \n for newline,
   \t for tab,
   \r for carriage return,
   \\ for backslash

Some tools add \0 for NUL.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top