You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.
Assuming the data stream consists of a universe of elements chosen from a set of size N
, you can do the following steps, copied from the slides linked above.
- Pick a hash function
h
that maps each of theN
elements to at leastlog_2 (N)
bits. - For each stream element
a
, letr(a)
be the number of trailing0
's inh(a)
. Record R = the maximum
r(a)
seen.Estimated number of distinct elements =
2^R
.