Question

I'm trying to find the intersection of three hashsets of varying sizes. Is there any difference in the rate at which the intersection can be found by altering the order in which the sets are intersected. An example program would be as follows:

public class RetainTest {
    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();

    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      

    public static void main(String[] args){
        preamble()
        large.retainAll(medium);
        large.retainAll(small);

        System.out.println(large.size());
    }


    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();

        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }

        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }

    }

}
Was it helpful?

Solution 2

Cost of queries on hash sets don't depend on the size of the set. setA.retainAll(setB) is iteration through setA with queries on setB (see implementation of AbstractCollection.retainAll()). Overall cost of this operation linearly depends on the size of setA. Therefore you should always iterate through the least set:

small.retainAll(medium);
small.retainAll(large);

Benchmark by Richard Tingle proven this.
EDIT Ah, Richard Tingle is the author of the question too :)

If you have exactly three sets and performance really matters, try to find intersection during the single iteration:

Iterator<E> it = small.iterator();
while (it.hasNext()) {
    E e = it.next();
    if (!medium.contains(e) || !large.contains(e))
        it.remove();
}

Since Java 8:

small.removeIf(e -> !medium.contains(e) || !large.contains(e));

OTHER TIPS

Profiling suggests that the fastest way to combine several sets is to retainAll the larger sets into the smaller set. Further more the order of those retains should also be smallest to largest. So

    small.retainAll(medium);
    small.retainAll(large);

The profiling suggests the difference is significant: for this data set the slowest order took approximately 10 times as long as the slowest order

enter image description here

Test program

These results are created using the following test program which was left to run for 20 minutes

public class RetainTest {
    
    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();
    
    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      
    
    public static void main(String[] args){
        while(true){
            preamble();
            int size1=largeMediumSmall().size();
            preamble();
            int size2=largeSmallMedium().size();
            preamble();
            int size3=smallMediumLarge().size();
            preamble();
            int size4=smallLargeMedium().size();
            preamble();
            int size5=mediumSmallLarge().size();
            preamble();
            int size6=mediumLargeSmall().size();
            
            //sanity check + ensuring the JIT can't optimise out
            if (size1!=size2 || size1!=size3 || size1!=size4 || size1!=size5 || size1!=size6){
                System.out.println("bad");
            }
        }
        

    }
    
    public static Set<Integer> largeMediumSmall(){
        large.retainAll(medium);
        large.retainAll(small);
        
        return large;
    }
    
    public static Set<Integer> smallMediumLarge(){
        small.retainAll(medium);
        small.retainAll(large);
        
        return small;
    }
    public static Set<Integer> smallLargeMedium(){
        small.retainAll(large);
        small.retainAll(medium);
        
        return small;
    }
    public static Set<Integer> mediumSmallLarge(){
        medium.retainAll(small);
        medium.retainAll(large);
        
        return medium;
    }
    public static Set<Integer> mediumLargeSmall(){
        medium.retainAll(large);
        medium.retainAll(small);
        
        return medium;
    }
    public static Set<Integer> largeSmallMedium(){
        large.retainAll(small);
        large.retainAll(medium);
        
        return large;
    }
    
    
    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();
        
        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }
        
        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }
        
    }
    
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top