Dedupe while or after write
https://softwareengineering.stackexchange.com/questions/236332
Domanda
I have a summary tool written in python that reads input files, and writes them into a summary file. I have the following stipulations:
- No duplicates.
- If it exists, add a count to it.
Is it better / faster to write one GIANT output file THEN de-duplicate the entries or to dedupe as you go (ie, each write command checks before writing)?
The small files are about 100-300k each, there can be hundreds of thousands of these, and the final output file is usually 1-4 MB.
A sample line in the file would be like this:
String,number
I would be checking the string for dupes. If it were a dupe, I would output:
string,COUNT,number(additive)
as in, I would keep add the numbers every time I'd get a duplicate, and keep a count of how many times it was duplicated.
I would say there are mostly duplicates.
Soluzione
Old question, but I'll take a stab at it. Writing to memory is almost always going to be faster than writing to disk. It would be better to use a data type like a Set
in Java that doesn't have duplicate entries and keep track of them as you go. Then, once you have the data structure in memory, write it to disk.