Why is Set.contains() much faster for Point than for List<Integer> of size 2?

https://stackoverflow.com/questions/23629794

21-07-2023
|

Question

I recently needed to do many-many lookups on a set of integer pairs for solving a problem in a programming contest. I program in Java. Normally I would use arrays for pairs of integers, but because sets in Java do not work with arrays, I decided to use lists this time. I later discovered that my solution was slow, but another contestant's solution using Point instead was much faster.

I implemented a small benchmark to compare the two. I discovered that Set.contains() on a Set<List<Integer>> (where each List<Integer> is of size 2) is on average about 4.5 times slower than on a Set<Point>, regardless of whether the pair is actually contained in the set. I thought the reason could stem from different implementations of hashCode() and additionally implemented a custom Point class that uses List.hashCode(). This appears to be part of the reason, but the difference with an actual List<Integer> is still considerable.

This is my benchmark code:

import java.awt.Point;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;

public class IntPairTiming {

    public static final int SIZE = 1_000_000;
    public static final int TRIES = 100_000_000;

    public static void main(String[] args) {

        Random random = new Random();

        // Pairs to store
        Set<Point> pointSet = new HashSet<>();
        Set<List<Integer>> listSet = new HashSet<>();
        Set<MyPoint> myPointSet = new HashSet<>();
        for (int i = 0; i < SIZE; i++) {
            Point point = new Point(random.nextInt(), random.nextInt());
            pointSet.add(point);
            List<Integer> list = Arrays.asList(point.x, point.y);
            listSet.add(list);
            myPointSet.add(new MyPoint(point.x, point.y, list));
        }

        // Pair to lookup
        Point point = new Point(random.nextInt(), random.nextInt());
        List<Integer> list = Arrays.asList(point.x, point.y);
        MyPoint myPoint = new MyPoint(point.x, point.y, list);
        pointSet.add(point);
        listSet.add(list);
        myPointSet.add(myPoint);

        // Time set of points
        timeLookup(pointSet, point, "Set of Points");

        // Time set of lists
        timeLookup(listSet, list, "Set of Lists");

        // Time set of mypoints
        timeLookup(myPointSet, myPoint, "Set of MyPoints");
    }

    private static <T> void timeLookup(Set<T> set, T obj, String what) {
        System.out.println("-----------");
        System.out.printf("%s:%n", what);
        System.out.printf("Contains = %s%n", set.contains(obj));
        long s = System.nanoTime();
        for (int i = 0; i < TRIES; i++) set.contains(obj);
        long e = System.nanoTime();
        System.out.printf("Average lookup = %sns%n", (e - s) / TRIES);
    }

    private static class MyPoint extends Point {

        private final List<Integer> list;

        public MyPoint(int x, int y, List<Integer> list) {
            super(x, y);
            this.list = list;
        }

        @Override
        public int hashCode() {
            return this.list.hashCode();
        }
    }
}

Output:

-----------
Set of Points:
Contains = true
Average lookup = 12ns
-----------
Set of Lists:
Contains = true
Average lookup = 55ns
-----------
Set of MyPoints:
Contains = true
Average lookup = 25ns

Why is there such a noticeable difference? From my understanding, both hashCode() and equals() should perform more or less equivalently on a Point and a List<Integer> of size 2. What if we need to lookup tuples of more than two elements, would a custom class similar to Point always be more efficient than a list?

Solution

This is based on the fact that the expense for equals() on any sort of AbstractList is greater than one would imagine.

The implementation needs to construct an iterator, iterate it, and compare all elements (and hashCode would need to as well), while a dedicated Point class has two comparisons that are extremely easily optimized by the JIT (due to field offsets not changing as much) while a list requires much more indirection.

At a point (no pun intended) where your fields number into the tens it becomes a premature optimization to create a separate tuple class unless that tuple class itself has some use as a POJO in itself.

OTHER TIPS

Point contains primitive fields, whereas List<Integer> contains objects, although very simple.

The hash code calculation for Point will use these fields directly, while the hash code calculation for List<Integer> will, loosely speaking, require two more deference operations to get to the actual values.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow