Approximating unique string occurrences using a small, fixed amount of memory

https://cs.stackexchange.com/questions/99771

05-11-2019
|

Pergunta

Is there a small-memory way to estimate how many unique strings have been encountered without needing to know the strings themselves? The trick is that we only have a tiny amount of memory to track what strings we have seen. Tiny like a 64 bit integer or 64 byte string. (The EZ infinite-memory solution is to keep a hash of the strings.) The strings being tracked themselves can be very long.

To make sure "unique" is clear, let's say we receive four strings: "cat", "dog", "cow" and "cat". We would count that as three unique strings as "cat" appears twice.

This method can be lossy! Some hashing solution with collisions is perfectly acceptable. Ideally the solution is more accurate for smaller amounts. E.g. counting 2 unique strings, we can be pretty sure it's exactly 2, but if the method returns 200, that's probably a decent estimate. .

The partial solution I have been considering is:

Set the tracker value to 1
Set the counter value to 0
For each string, get its hash value. If the tracker number is not evenly divisible by the hash, then increment the counter and multiply the tracker by the hash

The problem is that the tracker will quickly overflow and that will then (I assume) affect what numbers evenly divide it.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange