سؤال

At a given instance can we find out unique words in a text stream. One naive solution i can think of is using a hashmap to keep words count.

But this would require keeping words which have word count more than 1 in hashmap. In case of long text stream, it will be lot of words to maintain. Is there a way to crunch on space complexity for this.

هل كانت مفيدة؟

المحلول

You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.

Assuming the data stream consists of a universe of elements chosen from a set of size N, you can do the following steps, copied from the slides linked above.

  1. Pick a hash function h that maps each of the N elements to at least log_2 (N) bits.
  2. For each stream element a, let r(a) be the number of trailing 0's in h(a).
  3. Record R = the maximum r(a) seen.

  4. Estimated number of distinct elements = 2^R.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top