Pergunta

I'm looking for an algorithm that will help me determine substring matches at scale.

I have a pool of 100+ million "needles" (strings). I can do as much pre-processing on them as I want, and storage is cheap.

On detection side, I have both a very large pool (hundreds of TB) of strings to search for needles in, and also want to be able to stream detection as text comes in. So it's important that the detection be very fast (algorithmically). I can also pre-process this text as part of the detection, obviously.

I can store a copy of all needles, so a probabilistic algorithm would be fine (say, instead of the correct N strings, the algorithm returns some false positives -- I can always do a plain string search afterwards).

There is significant structure to the strings (they happen to be source code -- snippets on the needle side and files in the haystack).

Appreciate any thoughts on where to explore.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top