I am working on application which processes large CSV files (several hundreds of MB's). Recently I faced a problem which at first looked as a memory leak in application, but after some investigation, it appears that it is combination of bad formatted CSV and attempt of CsvListReader to parse never-ending line. As a result, I got following exception:

at java.lang.OutOfMemoryError.<init>(<unknown string>)
at java.util.Arrays.copyOf(<unknown string>)
   Local Variable: char[]#13624
at java.lang.AbstractStringBuilder.expandCapacity(<unknown string>)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(<unknown string>)
at java.lang.AbstractStringBuilder.append(<unknown string>)
at java.lang.StringBuilder.append(<unknown string>)
   Local Variable: java.lang.StringBuilder#3
at org.supercsv.io.Tokenizer.readStringList(<unknown string>)
   Local Variable: java.util.ArrayList#642
   Local Variable: org.supercsv.io.Tokenizer#1
   Local Variable: org.supercsv.io.PARSERSTATE#2
   Local Variable: java.lang.String#14960
at org.supercsv.io.CsvListReader.read(<unknown string>)

By analyzing heap dump and CSV file based on dump findings, I noticed that one of columns in one of CSV lines was missing closing quotes, which obviously resulted in reader trying to find end of the line by appending file content to internal string buffer until there was no more heap memory.

Anyway, that was the problem and it was due to bad formatted CSV - once I removed critical line, problem disappeared. What I want to achieve is to tell reader that:

  • All the content it should interpret always ends with new line character, even if quotes are not closed properly (no multi-line support)
  • Alternatively, to provide certain limit (in bytes) of the CSV line

Is there some clear way to do this in SuperCSV using CsvListReader (preferred in my case)?

有帮助吗?

解决方案

That issue has been reported, and I'm working on some enhancements (for a future major release) at the moment that should make both options a bit easier.

For now, you'll have to supply your own Tokenizer to the reader (so Super CSV uses yours instead of its own). I'd suggest taking a copy of Super CSV's Tokenizer and modifying with your changes. That way you don't have to modify Super CSV, and you won't waste time.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top