Question

I have a summary tool written in python that reads input files, and writes them into a summary file. I have the following stipulations:

  1. No duplicates.
  2. If it exists, add a count to it.

Is it better / faster to write one GIANT output file THEN de-duplicate the entries or to dedupe as you go (ie, each write command checks before writing)?

The small files are about 100-300k each, there can be hundreds of thousands of these, and the final output file is usually 1-4 MB.

A sample line in the file would be like this:

String,number

I would be checking the string for dupes. If it were a dupe, I would output:

string,COUNT,number(additive)

as in, I would keep add the numbers every time I'd get a duplicate, and keep a count of how many times it was duplicated.

I would say there are mostly duplicates.

Was it helpful?

Solution

Old question, but I'll take a stab at it. Writing to memory is almost always going to be faster than writing to disk. It would be better to use a data type like a Set in Java that doesn't have duplicate entries and keep track of them as you go. Then, once you have the data structure in memory, write it to disk.

Licensed under: CC-BY-SA with attribution
scroll top