Performance of calculating Java Collection difference or intersection

https://stackoverflow.com/questions/10547335

07-06-2021
|

Question

Java allows the calculation of the (set theoretic) difference and the intersection of two Collection objects, via the removeAll() and retainAll() methods of the Collection interface.

The implementation of these 2 methods in the AbstractCollection class of Java 6 is

public boolean removeAll(Collection<?> c) { // Difference
boolean modified = false;
Iterator<?> e = iterator();
while (e.hasNext()) {
    if (c.contains(e.next())) {
    e.remove();
    modified = true;
    }
}
return modified;
}

public boolean retainAll(Collection<?> c) { // Intersection
boolean modified = false;
Iterator<E> e = iterator();
while (e.hasNext()) {
    if (!c.contains(e.next())) {
    e.remove();
    modified = true;
    }
}
return modified;
}

Is there any way of implementing or executing the above (obviously expensive) operations faster?

For example, would there be any overall performance gain from sorting a Collection before calculating the differences or the intersection?

Is there any class of the Collections framework preferable (performance-wise) for using these operations?

Solution

Yes, there is a faster method possible. The code you supplied loops through c for every element of e. With two arrays of 100 elements, it would compare approximately 100,000 elements.

If you sort both arrays first, you only have to keep comparing the top two elements. This would do a couple hundred comparisons. This would be similar to merge sort. Do to do an intersection of the sorted collections left and right:

function intersect(left, right)
    var list result
    while length(left) > 0 and length(right) > 0
        if first(left) == first(right)
            append first(left) to result
            left = rest(left)
            right = rest(right)
        else if first(left) < first(right)
            left = rest(left)
        else
            right = rest(right)
    end while
    return result

OTHER TIPS

These implementations are in AbstractCollection and therefore they are very generic since at this level of abstraction very little is know about the collections and the number of available operations is very limited. It's hard to anything much smarter given only what the Collection interface allows and not knowing anything about the kind of collection and its implementation details. Sorting may or may not be effective depending on size and type of collection in question, which at this level the code can't know.

Reading the javadoc of AbstractCollection:

To implement an unmodifiable collection, the programmer needs only to extend this class and provide implementations for the iterator[...]

So I believe that you should check how the Iterator is implemented for a specific class, to really understand the performance of those methods.

Is there any way of implementing or executing the above (obviously expensive) operations faster?

How expensive these operations really are depends on how the collection passed as argument implements contains(). If it's a HashSet, contains is a constant (expected) time operation, causing removeAll or retainAll to complete in linear (expected) time.

Sorting would be more expensive that.

And well, it is reasonable that set operations are most efficient when done on a Set, isn't it?

If the elements in the collection are enums or dense integers, you can get more speed with an EnumSet or a BitSet.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow