Java CSV parser with unescaped quotes [closed]

Question 1

The right solution is to find the person who generated the data and beat them over the head with a keyboard until they fix the problem on their end.

Once you've exhausted that route, you could try some of the other CSV parsers on the market, I've used OpenCSV with success in the past.

Even if OpenCSV won't solve the problem out of the box, the code is fairly easy to read and available under an Apache license, so it might be possible to modify the algorithm to work with your wonky data, and probably easier than starting from scratch.

Question 2

Surprising even myself here, but I think I would hack it myself. I mean, you only need to read the lines and generate the tokens by splitting on quotes/commas, whichever you want. That way you can adjust the logic the way it suites you. It's not very hard. The file seems to be broken as much so that going through some existing solutions seems like more work.

One point though - if LibreOffice already parses it correctly, couldn't you just save the file from there, thus generating a file that is more reasonable. However, if you think LibreOffice might be guessing, just write the tokenizer yourself.

Question 3

+1 for the 'choking on fruit worms' pun - I nearly choked on my coffee reading that :)

If you really can't get that CSV fixed, then you could just supply your own Tokenizer (Super CSV is very flexible like that!).

You'd normally write your own readColumns() implementation, but it's quicker to extend the default Tokenizer and override the readLine() method to intercept the String (and fix the unescaped quotes) before it's tokenized.

I've made an assumption here that any quotes not next to a delimiter or at the start/end of the line should be escaped. It's far from perfect, but it works for your sample input. You can implement this however you like - it was too early in the morning for me to use a regex :)

This way you don't have to modify Super CSV at all (it just plugs in), so you get all of the other features like cell processors and bean mapping as well.

package org.supercsv;
import java.io.IOException;
import java.io.Reader;
import org.supercsv.io.Tokenizer;
import org.supercsv.prefs.CsvPreference;

public class FruitWormTokenizer extends Tokenizer {

  public FruitWormTokenizer(Reader reader, CsvPreference preferences) {
    super(reader, preferences);
  }

  @Override
  protected String readLine() throws IOException {
    final String line = super.readLine();
    if (line == null) {
      return null;
    }

    final char quote = (char) getPreferences().getQuoteChar();
    final char delimiter = (char) getPreferences().getDelimiterChar();

    // escape all quotes not next to a delimiter (or start/end of line)
    final StringBuilder b = new StringBuilder(line);
    for (int i = b.length() - 1; i >= 0; i--) {
      if (quote == b.charAt(i)) {
        final boolean validCharBefore = i - 1 < 0
            || b.charAt(i - 1) == delimiter;
        final boolean validCharAfter = i + 1 == b.length()
            || b.charAt(i + 1) == delimiter;
        if (!(validCharBefore || validCharAfter)) {
          // escape that quote!
          b.insert(i, quote);
        }
      }
    }
    return b.toString();
  }
}

You can just supply this Tokenizer to the constructor of your CsvReader.