How should I define a good hashCode for a circular linked list in Java?

https://stackoverflow.com/questions/3747188

04-10-2019
|

Question

I have set up a circular linked list data structure that represents a word, and each element in the list is a letter from the word. At the bottom of my question are the class definitions of the list and element of the list.

The purpose of the list data structure is to be able to compare cyclic words. So... "picture" and "turepic" are the same cyclic word, so the two lists will be equal.

So I override equals() when comparing two lists, and I've read that whenever you have to override equals(), you have to also override hashCode(). However, I don't really have a good idea of how to do that.

How should I define a good hashCode for what I have set up? What things should I consider? In the example of the "picture" and "turepic", the two lists are equal so their hashCode needs to be the same. Any ideas?

Thanks, Hristo

public class Letter {
 char value;
 Letter theNextNode;

 /**
  * Default constructor for an element of the list.
  * 
  * @param theCharacter - the value for this node.
  */
 Letter(char theCharacter) {
  this.value = theCharacter;
 }
}


public class CircularWord {

 /*
  * Class Variables
  */
 Letter head;
 Letter tail;
 Letter theCurrentNode;

 int iNumberOfElements;


 /**
  * Default Constructor. All characters that make up 'theWord' are stored in a 
  * circular linked list structure where the tail's NEXT is the head. 
  */
 public CircularWord(String theWord) {

  char[] theCharacters = theWord.toCharArray();

  for (int iIndex = 0; iIndex < theCharacters.length; iIndex++) {
   this.addElement(theCharacters[iIndex]);
  }

  this.theCurrentNode = head;
  this.iNumberOfElements = theCharacters.length;
 }
}

Solution

How about the sum of the hashcodes of all the elements inside your list, each multiplied by an arbitrary value?

Something like

hashCode = 1;
for (char c : myChars) {
    hashCode += 31 * c;
}

OTHER TIPS

So you want a hashcode calculation which gives equal results for "picture" and "turepic", but (preferably) different from the hashcode of e.g. "eruptic". Thus it is not enough to simply add up the hashcodes of the letters contained in the word - you need to have some position information too, but still, it should be independent of the actual permutation of the word. You need to define "equivalence classes", and always calculate the same hashcode for each member of the class.

The easiest way to achieve this is to select a specific member of the equivalence class and always use the hashcode of that variation for all equivalent words. E.g. select the first variant alphabetically (thanks @Michael for summing it up concisely). For "picture" et al., that would be "cturepi". Both "picture" and "turepic" (and all the other equivalent variations) should return the hash code of "cturepi". That hash code could be calculated by the standard LinkedList method, or any other preferred way.

One might say that this calculation is very expensive. True, however one could cache the result, so that only the first calculation would be costly. And I guess the selection of the first alphabetical variant could be optimized fairly much in the common case (compared to the trivial solution of generating all permutations in the specific equivalence class, then sorting them and picking the first).

E.g. in many of the words, the first letter alphabetically is unique ("picture" is one of these - its first letter alphabetically is 'c', and there is only one 'c' in it). So you only need to find it, then calculate the hashcode starting from there. If it is not unique, you need to compare the second, third, etc. letters after that, until you find a difference (or you roll over).

Update 2 - examples

"abracadabra" contains 5 'a's. The 2nd characters after the 'a's are 'b', 'c', 'd', 'b' and 'a', respectively. So in the 2nd round of comparison you can conclude that the lexicographically smallest variation is "aabracadabr".
"abab" contains 2 'a's, and a 'b' after each (and then you roll over, reaching an 'a' again, so the quest ends there). So you have two identical lexicographically smallest variations of it. But since they are identical, they obviously produce the same hashcode.

Update: In the end, it all boils down to how much do you actually need the hashcode - i.e. do you plan to put your circular lists into an associative collection like Set or Map. If not, you can do with a simple, or even trivial hash method. But if you use some associative collection heavily, a trivial hash implementation gives you lots of collisions thus suboptimal performance. In this case, it is worth a try implementing this hash method and measuring whether it pays for itself in performance.

Update 3: sample code

Letter is basically left the same as above, I only made the fields private, renamed theNextNode to next, and added getters/setters as needed.

In CircularWord I made some more changes: dropped tail and theCurrentNode, and made the word really circular (i.e. last.next == head). The constructor, toString and equals are not relevant for computing the hashcode, so they are omitted for the sake of simplicity.

public class CircularWord {
    private final Letter head;
    private final int numberOfElements;

    // constructor, toString(), equals() omitted

    @Override
    public int hashCode() {
        return hashCodeStartingFrom(getStartOfSmallestRotation());
    }

    private Letter getStartOfSmallestRotation() {
        if (head == null) {
            return null;
        }
        Set<Letter> candidates = allLetters();
        int counter = numberOfElements;

        while (candidates.size() > 1 && counter > 0) {
            candidates = selectSmallestSuccessors(candidates);
            counter--;
        }
        return rollOverToStart(counter, candidates.iterator().next());
    }

    private Set<Letter> allLetters() {
        Set<Letter> letters = new LinkedHashSet<Letter>();
        Letter letter = head;

        for (int i = 0; i < numberOfElements; i++) {
            letters.add(letter);
            letter = letter.getNext();
        }
        return letters;
    }

    private Set<Letter> selectSmallestSuccessors(Set<Letter> candidates) {
        Set<Letter> smallestSuccessors = new LinkedHashSet<Letter>();

        char min = Character.MAX_VALUE;
        for (Letter letter : candidates) {
            Letter nextLetter = letter.getNext();
            if (nextLetter.getValue() < min) {
                min = nextLetter.getValue();
                smallestSuccessors.clear();
            }
            if (nextLetter.getValue() == min) {
                smallestSuccessors.add(nextLetter);
            }
        }
        return smallestSuccessors;
    }

    private Letter rollOverToStart(int counter, Letter lastCandidate) {
        for (; counter >= 0; counter--) {
            lastCandidate = lastCandidate.getNext();
        }
        return lastCandidate;
    }

    private int hashCodeStartingFrom(Letter startFrom) {
        int hash = 0;
        Letter letter = startFrom;
        for (int i = 0; i < numberOfElements; i++) {
            hash = 31 * hash + letter.getValue();
            letter = letter.getNext();
        }
        return hash;
    }

}

The algorithm implemented in getStartOfSmallestRotation to find the lexicographically smallest rotation of the word is basically what I describe above: compare and select the lexicographically smallest 1st, 2nd, 3rd etc. letters of each rotation, dropping the greater letters until either there is only one candidate left, or you roll over the word. Since the list is circular, I use a counter to avoid an infinite loop.

In the end, if I have a single candidate left, it may be in the middle of the word and I need to get the start of the smallest word rotation. However, as this is a singly-linked list, it is awkward to step backwards in it. Luckily, the counter nicely helps me out: it has recorded how many letters I have compared so far, but in a circular list this is equivalent to how many letters I can move forward before rolling over. Thus I know how many letters to move forward in order to get again to the beginning of the minimal word rotation I am looking for.

Hope this helps someone - at least it was fun to write :-)

Do you actually need to use your hashCodes? If you don't intend to place the object members in any kind of hash structure, you could just ignore the problem:

public int hashCode() {
    return 5;
}

this satisfies the requirements that equal instances have equal hash codes. Unless i knew I needed a better hash distribution, this would probably work well enough for my own needs.

But I think I might have an idea that gives better distribution of hashes. psuedo code:

hash = 0
for each rotation
    hash += hash(permutation)
end
hash %= MAX_HASH

Since hash() is likely to be O(n), then this algorithm is O(n^2), which is a bit slow, but hashes reflect the method used for equivalence testing, the distribution of hash codes is probably pretty decent. any other kernel (prod, xor) that is commutative will work as well as the sum used in this example.

int hashcode() {
    int hash = 0;
    for (c in list) {
        hash += c * c;
    }
    return hash;
}

Since + is commutative, equal words will have equal hashcodes. The hashcode is not very discriminating (all letter permutations get the same hash code), but it should do the trick unless you usually put many permutations into the HashSet.

Note: I add c * c rather than simply c in order to get less collisions for distinct letters.

Note 2: Unequal lists with equal hash codes do not violate of the contract for hashcode. Such "collisions" should be avoided because they reduce performance, but they do not threaten the correctness of the program. In general, collisions can not be avoided, though it is certainly possible to avoid them more than in my answer, but doing so makes the hashcode more expensive to compute, which might more than eat any performance gain.

define equals() and hashCode() for Letter. Do this using only the char field.
For CircularWord, implement hashCode() by iterating from head to tail XOR'ing the respective values of Letter.hashCode. Finally XOR the result with some constant.

Another way would be to canonicalize the cicular words, representing them as something like:

public class CircularWord {

    private static Set<String> canonicalWords = new HashSet<String>();
    private String canonicalWord;
    private int offset;

    public CircularWord(String word) {
        // Looks for an equal cirular word in the set (according to our definition)
        // If found, set canonicalWord to it and calculate the offset.
        // If not found, put the word in the set, set canonical word to our argument and set offset to 0.
    }
    // Implementation of CircularWord methods using
    // canonicalWord and offset
}

You would then implement equals() and hashCode() by delegating to the String implementations.

I misread your question - I thought you wanted different haschodes for "picture" and "turepic"; I think in this case, you can get a hint from the fact that two objects that are equal must have the same hash code, but two objects that have the same hash code may not necessarily be equal.

So you can use Vivien's solution which will guarantee that "picture" and "turepic" will have the same hash code. However, it also means that "picture" and "pitcure" would have the same hash codes as well. In this case, your equals method will have to be smarter and will have to figure out if the two list of letters actually represent the same word. Essentially your equals method helps resolve the collision that you can get from "picture"/"turepic" and "pitcure".

Bear in mind that hashcodes are not unique. Two different objects can hash to exactly the same value. Thus hashcode is insufficient for determining equality; you have to do the actual comparison in equals(). [SPECULATIVE COMMENT REMOVED. OMG]

hashcode() can just return a constant in all cases. This may affect performance but it's totally valid. Once you get everything else done, you can work on a more efficient hashcode() algorithm.

This is a good article. Note the 'lazy hashcode' section.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow