Question

How do I modify the procedure in Ukkonen's paper to hold a value for number of times a word appears in the text. Are there any such implementations available that provide the string frequency as well?

The modification I want is like for a string "hehe" the frequency count for all "h","e","he" should be 2 in the tree. Rest nodes have the default value 1.

I found some libraries like the best so far and some previous questions like this.

But none of them describes a good enough solution to my problem. Also I have to process a very large dictionary file (about a billion words). Then the algorithm needs to be really fast. And I am ready to compromize on space a bit.

Was it helpful?

Solution

The answer can be found here: Counting the number of substrings

Basically, build the suffix tree, match the substring starting from the root and count the leaf nodes below that point. That's the number of times the word appears in the text.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top