Вопрос

I have a source of strings (let us say, a text file) and many strings repeat multiple times. I need to get the top X most common strings in the order of decreasing number of occurrences.

The idea that came to mind first was to create a sortable Bag (something like org.apache.commons.collections.bag.TreeBag) and supply a comparator that will sort the entries in the order I need. However, I cannot figure out what is the type of objects I need to compare. It should be some kind of an internal map that combines my object (String) and the number of occurrences, generated internally by TreeBag. Is this possible?

Or would I be better off by simply using a hashmap and sort it by value as described in, for example, Java sort HashMap by value

Это было полезно?

Решение

Why don't you put the strings in a map. Map of string to number of times they appear in text. In step 2, traverse the items in the map and keep on adding them to a minimum heap of size X. Always extract min first if the heap is full before inserting.
Takes nlogx time.

Otherwise after step 1 sort the items by number of occurrences and take first x items. A tree map would come in helpful here :) (I'd add a link to the javadocs, but I'm in a tablet ) Takes nlogn time.

Другие советы

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top