Question

for years I've been parsing csv files during my nightly batch jobs using the following logic without issue. Anyhow I'm doing a full rewrite of the application and I'm now wondering if there may be any performance / quality gains in using something like opencsv? I have no experience with other libraries, so I was hoping someone else with experience could chime in.

while ((line = br.readLine()) != null) {
    String[] items = line.split(",");


        for (int i = 0; i < items.length; ++i) {
            // Remove extra quote
            if (items[i].length > 2) {
                items[i] = items[i].replaceAll("\"", "");
            }

            // Replace blank items with nulls
            if (items[i].matches("^\\s*$")) {
                items[i] = null;
            }

        }

        String item0 = item[0];
        String item1 = item[1];
}
Was it helpful?

Solution

You won't gain any performance, but a library will help you deal with fields that have embedded commas. Microsoft's obnoxious solution of using double-quotes instead of escaping the commas is a pain to deal with by hand, and opencsv will handle all of that for you.

OTHER TIPS

The answer given by chrylis is right, that you may not gain performance but yes opencsv will handle all the cases for you.
But if you are really worried about performance then a little tweak in your code can help you improve performance,
After analyzing the code for String.Split which is

    public String[] split(String regex) {
       return split(regex, 0);
    }
    public String[] split(String regex, int limit) {
           return Pattern.compile(regex).split(this, limit);
    }

For each of your String a new Pattern is compiled, code for Pattern.compile is

public static Pattern compile(String regex, int flags) {
     return new Pattern(regex, flags);
 }

The above code to create Pattern object is again repeated at,

items[i].matches("^\\s*$")

So if your files are having millions of lines, then creating millions of Pattern object can be overhead so you can change your code as,

    Pattern pat = Pattern.compile(","); 
    Pattern regexPattern = Pattern.compile("^\\s*$");       
    while ((line = br.readLine()) != null) 
    {
        String[] items = pat.split(line, 0);
        for (int i = 0; i < items.length; ++i) 
        {           
            if (items[i] != null && items.length > 2) // I think it should be items[i].length() > 2 
            { //We can also remove this null check as splitted strings will never be null
                items[i] = items[i].replaceAll("\"", "");
            }               
            if (regexPattern.matcher(items[i]) .matches()) {
                items[i] = null;
            }
        }           
    }

The performance gain will not be visible in small files but for big files and if same code executes for millions of file you will see a significant performance improvement.

To add to your options, consider the Jackson CsvMapper.

I parse 36 million rows out of around 4k files in 12 minutes using the jackson CsvMapper on a macbook pro. That's using it to map directly to POJOs in some places and using it to read Object[] per line in others and applying a huge amount of ancillary processing to normalise inputs.

It's also really easy to use:

as Object[]

    CsvMapper mapper = new CsvMapper();
    mapper.enable(CsvParser.Feature.WRAP_AS_ARRAY);
    File csvFile = new File("input.csv"); // or from String, URL etc
    MappingIterator<Object[]> it = mapper.reader(Object[].class).readValues(csvFile);

as POJOs

    public class CSVPerson{
      public String firstname;
      public String lastname;
      //etc
    }

    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.emptySchema().withHeader().withColumnSeparator(delimiter);
    MappingIterator<CSVPerson> it = = mapper.reader(CSVPerson).with(schema).readValues(input);
    while (it.hasNext()){
      CSVPerson row = it.next();
    }

I'm always singing the praises of this library, it's great. It's also really flexible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top