문제

I have a summary tool written in python that reads input files, and writes them into a summary file. I have the following stipulations:

  1. No duplicates.
  2. If it exists, add a count to it.

Is it better / faster to write one GIANT output file THEN de-duplicate the entries or to dedupe as you go (ie, each write command checks before writing)?

The small files are about 100-300k each, there can be hundreds of thousands of these, and the final output file is usually 1-4 MB.

A sample line in the file would be like this:

String,number

I would be checking the string for dupes. If it were a dupe, I would output:

string,COUNT,number(additive)

as in, I would keep add the numbers every time I'd get a duplicate, and keep a count of how many times it was duplicated.

I would say there are mostly duplicates.

도움이 되었습니까?

해결책

Old question, but I'll take a stab at it. Writing to memory is almost always going to be faster than writing to disk. It would be better to use a data type like a Set in Java that doesn't have duplicate entries and keep track of them as you go. Then, once you have the data structure in memory, write it to disk.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 softwareengineering.stackexchange
scroll top