In Java (1.5 or later), what is the best performing way to fetch an (any) element from a Set?

https://stackoverflow.com/questions/4356517

08-10-2019
|

문제

In the code below, I needed to fetch an element, any element, from toSearch. I was unable to find a useful method on the Set interface definition to return just a single (random, but not required to be random) member of the set. So, I used the toArray()[0] technique (present in the code below).

private Set<Coordinate> floodFill(Value value, Coordinate coordinateStart)
{
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();

    Set<Coordinate> toSearch = new LinkedHashSet<Coordinate>();
    toSearch.add(coordinateStart);
    while (toSearch.size() > 0)
    {
        Coordinate coordinate = (Coordinate)toSearch.toArray()[0];
        result.add(coordinate);
        toSearch.remove(coordinate);
        for (Coordinate coordinateAdjacent: getAdjacentCoordinates(coordinate))
        {
            if (this.query.getCoordinateValue(coordinateAdjacent) == value)
            {
                if (!result.contains(coordinateAdjacent))
                {
                    toSearch.add(coordinateAdjacent);
                }
            }
        }
    }

    return result;
}

The other technique I have seen discussed is to replace "(Coordinate)toSearch.toArray()[0]" with "toSearch.iterator().next()". Which technique, toArray() or iterator(), is the most likely to execute the most quickly with the least GC (Garbage Collection) impact?

My intuition (after composing this question) is that the second technique using the Iterator will be both faster in execution and lower overhead for the GC. Given I don't know the implementation of the Set being passed (assuming HashSet or LinkedHashSet as most likely), how much overhead is incurred in each of the toArray() or iterator() methods? Any insights on this would be greatly appreciated.

Questions (repeated from above):

Which technique, toArray() or iterator(), is the most likely to execute the most quickly with the least GC (Garbage Collection) impact?
Given I don't know the implementation of the Set being passed (assuming HashSet or LinkedHashSet as most likely), how much overhead is incurred in each of the toArray() and iterator() methods?

해결책

toSearch.iterator().next() will be faster and less memory-intensive because it does not need to copy any data, whereas toArray will allocate and copy the contents of the set into the array. This is irrespective of the actual implementation: toArray will always have to copy data.

다른 팁

From what I can see you are doing Breadth First Search

Below is the example how it could be implemented without using toArray:

    private Set<Coordinate> floodFill(Value value, Coordinate coordinateStart) {
    final Set<Coordinate> visitedCoordinates = new LinkedHashSet<Coordinate>();
    final Deque<Coordinate> deque = new ArrayDeque<Coordinate>();

    deque.push(coordinateStart);

    while (!deque.isEmpty()) {
        final Coordinate currentVertex = deque.poll();
        visitedCoordinates.add(currentVertex);
        for (Coordinate coordinateAdjacent : getAdjacentCoordinates(currentVertex)) {
            if (this.query.getCoordinateValue(coordinateAdjacent) == value) {
                if (!visitedCoordinates.contains(coordinateAdjacent)) {
                    deque.add(coordinateAdjacent);
                }
            }
        }
    }

    return visitedCoordinates;
}

Implementation notes:

And now I am concerned that the contains() method's implementation on LinkedList could be doing up to a a full scan of the contents before returning the answer.

You are right about full scan (aka linear search). Nevertheless, In your case it's possible to have additional set for tracking already visited vertexes(btw, actually it's your result!), that would solve issue with contains method in O(1) time.

Cheers

Here's how I'd implement this:

private Set<Coordinate> floodFill(Value value, Coordinate start) {
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();
    LinkedList<Coordinate> toSearch = new LinkedList<Coordinate>();
    toSearch.add(start);
    do {
        Coordinate coordinate = toSearch.removeFirst();
        if (result.add(coordinate)) {
            for (Coordinate ajacent: getAdjacentCoordinates(coordinate)) {
                if (this.query.getCoordinateValue(adjacent) == value) {
                    toSearch.add(adjacent);
                }
            }
        }
    } while (!toSearch.isEmpty());
    return result;
}

Notes:

If you think about it, the toSearch data structure doesn't need to contain unique elements.
Using a LinkedList for toSearch means that there is a simple method to get an element and remove it in one go.
We can use the fact that Set.add(...) returns a boolean to have the number of lookups in the result set ... compared with using Set.contains().
It would be better to use HashSet rather than LinkedHashSet for the results ... unless you need to know the order in which coordinates were added by the fill.
Using == to compare Value instances is potentially a bit dodgy.

After Petro's response, I copied the method and reimplemented it according to his suggestions. It looks like this:

private Set<Coordinate> floodFind2(Value value, Coordinate coordinateStart)
{
    Set<Coordinate> result = new LinkedHashSet<Coordinate>();

    Queue<Coordinate> toSearch = new LinkedList<Coordinate>();
    toSearch.add(coordinateStart);
    while (!toSearch.isEmpty())
    {
        Coordinate coordinate = toSearch.remove();
        result.add(coordinate);
        for (Coordinate coordinateAdjacent: getAdjacentCoordinates(coordinate))
        {
            if (getCoordinateValue(coordinateAdjacent).equals(value))
            {
                if (!result.contains(coordinateAdjacent))
                {
                    if (!toSearch.contains(coordinateAdjacent))
                    {
                        toSearch.add(coordinateAdjacent);
                    }
                }
            }
        }
    }

    return result;
}

By moving from Set to Queue, my efficiency questions shifted to the new conditional check I had to add, "if (!toSearch.contains(coordinateAdjacent))". Using the Set interface, it silently stopped me from adding duplicates. Using the Queue interface, I have to check to ensure I'm not adding a duplicate.

And now I am concerned that the contains() method's implementation on LinkedList could be doing up to a a full scan of the contents before returning the answer. So, comparing this method to the one I originally posted, which is likely to be more efficient (before I go spend a good quantity of time doing the empirical testing)?

Okay, below is my latest implementation incorporating feedback (mainly from Stephen, Cameron and Petro) which includes completely eliminating the toArray()[]-vs-interator().next() conflict. And I have sprinkled in comments to more accurately distinguish what's occurring and why. And to better clarify why I concretely implemented Petro's original "use a tracking Set" advice (seconded by Cameron). And just after the code snippet, I will contrast it with the other proposed solutions.

private Set<Coordinate> floodFind3(Coordinate coordinate)
{
    Set<Coordinate> area = new LinkedHashSet<Coordinate>(); //includes only area of value (which is the same as at coordinate)

    area.add(coordinate);
    Value value = getCoordinateValue(coordinate); //value upon which to expand area
    Set<Coordinate> checked = new LinkedHashSet<Coordinate>(); //every coordinate evaluated regardless of value
    checked.add(coordinate);
    Queue<Coordinate> candidates = new LinkedList<Coordinate>(); //coordinates evaluated, were of value, and are queued to iterate through their adjacents
    candidates.add(nordinate);
    while (!candidates.isEmpty())
    {
        for (Nordinate coordinateAdjacent: this.query.getNordinates().getAdjacent(candidates.remove()).getOrthogonal())
        {
            if (checked.add(coordinateAdjacent)) //only expands containing value and !value
            {
                if (getCoordinateValue(coordinateAdjacent) == value)
                {
                    area.add(coordinateAdjacent); //only expands containing value
                    candidates.add(coordinateAdjacent); //expands and contracts containing value
                }
            }
        }
    }

    return area;
}

I have updated the method several significant ways:

One less method parameter: I removed a parameter as it was derivable from the search and eliminated a possible logical issue where the starting coordinate is pointing at a location containing !value.
Three collections track the search; area (Set), checked (Set) and candidates (Queue). The code comments clarify the specific use of each. Used LinkedHashSet for reliable reproducability while chasing bugs and performance issues (http://stackoverflow.com/questions/2704597/iteration-order-of-hashset). Once stable, I will likely revert to faster HashSet implementation.
Reordered the "check if already evaluated" test prior to the "is value" test to only visit every coordinate exactly once. This avoids revisiting !value adjacent coordinates more than once. Also incorporated Stephen's clever double use of the Set add() method. This becomes very important as the area to flood becomes more maze-like (snakely/spidery).
Kept the "==" for checking value forcing a reference comparison. Value is defined to be a Java 1.5 Enum and I didn't want to depend upon HotSpot to both inline the .equals() method call and reduce it to a reference comparison. If Value were ever to change from being an Enum, this choice could come back to bite me. Tyvm to Stephen for pointing this out.

Petro's and Stephan's solutions visit the coordinates containing value just once, but require revisiting the coordinates containing !value more than once, which can result in quite a few duplicate fetches/value checks for areas consisting of long maze-like tunnels. While "long maze-like tunnels" may be considered a pathological case, it is more typical of the particular domain for which I need this method. And my "2nd" attempted solution (which had the poor performance LinkedList contains() call) was questionable as a real answer ({nod} to Stephen on that one).

Thank you for all your feedback.

Next up, lots of empirical testing with single variations/changes over hundreds of millions of invocations. I'll update this answer with details sometime this weekend.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow