質問

I hope that this question is specific enough to be deemed fit for StackOverflow. I checked the FAQ and I think this qualifies, since it is specific and related to programming.

I'm implementing a complex data mining algorithm (FP-growth) in Java. Some of the initial phases of the algorithm require me to scan a large database and keep a running count of each item type found. This seems perfectly suited to a Hashbag interface. I found one in Apache Commons which seems to work for me.

So now, my HashBag is filled with [itemType, count] entries (pairs). Later on in the algorithm, I'm required to do a lot of list-like operations on these pairs. In some cases, I must sort the collection by itemType. In others, I must sort by count. This seems perfectly suited to a List interface.

I'm left with the conclusion that I must convert my Hasbag to a List. Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?

One alternative is to make my own interface which is truly a list, but allows "bag-style" adds. I'd have to keep the list sorted and perform binary searches with a custom comparator every time I wanted to add something. Building that collection would probably take longer than building a Hashbag, but I'd save on the conversion step at the end. Any thoughts as to which is preferable?

Thanks!

役に立ちましたか?

解決

If you used Guava's Multiset instead of Apache's Bag -- roughly analogous, but in a different style -- you can do most of this without converting. Multiset.entrySet() returns a Set<Entry<E>>, with Entry<E> effectively representing a pair of an element and a count -- that sounds like it's probably the best way to address your need to operate on the element-count pairs, maybe? You can iterate over that like you'd iterate over a Map.entrySet().

You can use Multisets.copyHighestCountFirst(Multiset) to get a multiset reordered in highest-frequency-first order, and use TreeMultiset to order by the elements directly.

(Disclosure: I contribute to Guava.)

他のヒント

I assume you're using the Apache Commons Collections HashBag class. Have you considered using TreeBag instead? It implements the same Bag interface but efficiently keeps the data sorted according to a comparator you provide.

That said, when you need to change sort order, there isn't usually any better answer than to copy the collection to a new one with a different comparator.

Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?

Sometimes it is necessary to convert between collection types. If it is necessary "dirty" or "inelegant" or "dumb" are not really relevant.

It can also be a mistake to over-think these things up front. The actual computational trade-offs are often difficult to grasp. For instance, if you changed the HashBag to a TreeBag, insertion goes from O(1) to O(logN) but you then avoid the overheads of sorting and copying. "Big Oh" analysis / thinking is not going to give you a clear answer. Indeed, the real performance is going to depend on the scaling factors, the values of N, the ratio of hits and misses in the bag and so on.

I would advise to try implementing things the obvious way, and see if it performs well enough ... and if not, profile it to see if the data structures are the main bottleneck. Then based on the profiling, and other measurements of the input datasets, figure out the best way to improve performance from your baseline implementation.

Answering my own question!

I did some experimenting with the different types of Multiset provided by the Guava libary mentioned above by Louis Wasserman. In my particular test case, I'm parsing a 1GB XML file (database of books and authors) and creating a very large Multiset (keeping a count of how many times each author shows up in the DB). Once I reach the end of the parsing, I need to get a new Multiset which only contains the authors who showed up more than x times, where x is some threshold value. I also want my final set to be sorted by author name.

Here are two of the different ways I tried it (among others):

1) collect the original counts in a TreeMultiset and then remove any which don't meet the threshold 2) collect the original counts in a HashMultiset, and then create a new TreeMultiset where I add each item from the hash set with a count the meets the threshold

The second way proved to be significantly faster (roughly 25%), despite the conversion and extra memory usage. Obviously a big part of this is that it is pretty inefficient to delete from binary trees.

So the clear conclusion here is that in this situation, conversion is a good move (unless you have memory constraints that won't allow it).

Thanks again for turning me onto the Guava library, Louis!

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top