get unique words in text stream

https://stackoverflow.com/questions/23644241

22-07-2023
|

سؤال

At a given instance can we find out unique words in a text stream. One naive solution i can think of is using a hashmap to keep words count.

But this would require keeping words which have word count more than 1 in hashmap. In case of long text stream, it will be lot of words to maintain. Is there a way to crunch on space complexity for this.

المحلول

You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.

Assuming the data stream consists of a universe of elements chosen from a set of size N, you can do the following steps, copied from the slides linked above.

Pick a hash function h that maps each of the N elements to at least log_2 (N) bits.
For each stream element a, let r(a) be the number of trailing 0's in h(a).
Record R = the maximum r(a) seen.
Estimated number of distinct elements = 2^R.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow