Big data sort and search

https://stackoverflow.com/questions/3926663

30-09-2019
|

문제

I have two files of data, 100 char lines each. File A: 10⁸ lines, file B: 10⁶ lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 10⁸ records.

I'm waiting for your suggestions on this.

해결책

You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:

Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.

When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.

다른 팁

You can slightly improve on @michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.

For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:

#!/usr/bin/python3

import sys

if __name__=='__main__':
  b = open(sys.argv[2],'r')
  bs = set()
  for l in b:
    bs.add(l.strip())
  b.close()
  a = open(sys.argv[1],'r')
  for l in a:
    l = l.strip()
    if l in bs:
      bs.remove(l)
  for x in bs:
    print(x)

I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:

25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
  60298   60298  509244

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow