Dedupe while or after write

https://softwareengineering.stackexchange.com/questions/236332

03-10-2020
|

문제

I have a summary tool written in python that reads input files, and writes them into a summary file. I have the following stipulations:

No duplicates.
If it exists, add a count to it.

Is it better / faster to write one GIANT output file THEN de-duplicate the entries or to dedupe as you go (ie, each write command checks before writing)?

The small files are about 100-300k each, there can be hundreds of thousands of these, and the final output file is usually 1-4 MB.

A sample line in the file would be like this:

String,number

I would be checking the string for dupes. If it were a dupe, I would output:

string,COUNT,number(additive)

as in, I would keep add the numbers every time I'd get a duplicate, and keep a count of how many times it was duplicated.

I would say there are mostly duplicates.

해결책

Old question, but I'll take a stab at it. Writing to memory is almost always going to be faster than writing to disk. It would be better to use a data type like a Set in Java that doesn't have duplicate entries and keep track of them as you go. Then, once you have the data structure in memory, write it to disk.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 softwareengineering.stackexchange