Frage

I have a requirement to track the occurrence of the words in the text and this occurrence need to be arranged in the descending order. I initially used the hash map data structure however as I was researching further on, I discovered the "Trie" data structure.

I think "Trie" data structure is perfect for tracking the occurrence in terms of both flexibility and complexity. However there is one more requirements, I need to sort the occurrence in descending order. So basically traversing the "Trie" in depth first search.

Implementation wise this is little tricky, so I was wondering if I was in the right track. Any kind of opinion would be great. What would be the best data structure to use in this case?

Note: Sort order is descending in terms of occurrence so if "A" appeared 5 times and "B" appeared 2 times sort order should be "A", "B". Also two words with same occurrences would be sorted in alphabetical order.

Thanks

War es hilfreich?

Lösung

If the prefixes of the words are repeatable, the trie tree will be most memory-efficient solution, unfortunately still O(N) pessimistically. You'll need to enrich the standard trie-tree class with additional information - words counters.

If you're looking for pessimistically optimal solution, multimap is a better solution:

  • O(1) insert time (not in trie tree if you have alphabet with many letters)

  • O(N) memory and running time

Still, you'll need to sort the words within the same occurrence count bucket, if there're many words with the same occurrence number, sorting becomes the dominant operation, and trie-tree approach is the same as multimap approach.

Andere Tipps

the main property of trie is to merge the incoming data to save space, so if you want to use any property which is individual to any of the data unit, you can not benefit from the trie built in properties. So you can think if you want to save space, use trie, but to get the most frequent word, somehow you need to use some other algorithm (like traversing the trie once the data has been collected and prepare another table).

My idea is probably priority queue with the frequency of the word as the key can be a possible candidate

You can use a ternary trie but the insertion time is expensive but you can skip the sort algorithm when you are just interested in the top 5 most occurrence words.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top