Help designing a hash function to detect duplicate records?

https://stackoverflow.com/questions/4338200

30-09-2019
|

Question

Let me explain my program thus far. It is a rubiks cube solver. I am given a scrambled cube (this is the initial state). This becomes the root node of a graph. I am using iterative deepening depth first search to "brute force" this scrambled cube to a recognizable state which I can then use pattern recognition to solve.

As you can imagine, this is a very large graph, so I would like to come up with some sort of hashing functionality to detect duplicate nodes in this graph (thus speeding up the traversal).

I am largely unfamiliar with hashing functions, but here is what I am thinking... Each node is essentially a different state of the rubik's cube. So if I come to a cube state (node) that has already be seen, I want to skip over it. So I need a hashing function that takes me from the state variable to a checksum, where the state variable is a 54-character string. The only allowed characters are y, r, g, o, b, w (which correspond to colors).

Any help designing this hash function would be greatly appreciated.

Solution

For the fastest duplicate detection and removal - avoid generating many of the repeated positions in the first place. This is easy to do and quicker than generating and then finding the repeats. So for example if you have moves like F and B, if you allow the sub sequence FB don't also allow BF, which gives the same result. If you've just done 3F, don't follow it with F. You can generate a small look-up table for allowed next moves, given the last three moves.

For the remaining duplicates you want a fast hash because there are a lot of positions. To make your hash go fast, as others have commented, you want what it hashes from, the representation of the position, to be small. There are 12 edge cubies and there are 8 corner cubies. Representing each cubies position and orientation need take only five bits per cubie, i.e. 100 bits (12.5 bytes) total. For edges its four bits for position and one for flip. For corners its three bits for position and 2 for spin. You can ignore the last edge cubie since its position and flip is fixed by the others. With this representation you are already down to 12 bytes for the position.

You have about 70 real bits of information in a rubik cube position, and 96 bits is close enough to 70 to make it actually counter productive hashing those bits further. I.e. treat this representation of the board as your hash. That may sound a bit strange, but from your question I'm envisaging you at the same time experimenting with a less compact representation of the cube that is more amenable to your pattern matching. In that case the 12 byte value can be treated as if it were a hash, with the advantage that it's a hash that never has a collision. That makes the duplicate testing code and new value insertion shorter and simpler and faster. It's going to be cheaper than the MD5 solutions suggested so far.

There are many other tricks you could use to cut down the work in searching for repeated positions. Have a look at http://cube20.org/ for ideas.

OTHER TIPS

You can always try a cryptographic hash function. Since your problem is not a question of security (there is no attacker purposely trying to find distinct states which hash to the same value), you can use a broken hash function. I recommend trying MD4, which is quite fast. Your 54-character string is quite appropriate for MD4 input (MD4 can process inputs up to 55 bytes as a single block).

A basic 2.4 GHz PC can hash about 12 millions such strings per second, using a single core, with a simple unrolled C implementation (e.g. one which would look like the MD4Transform() function in the sample code included in RFC 1320). This may be enough for your needs.

1) Don't Use A Hash You have 9*6 = 54 separate faces on a rubik cube. Even wastefully using 1 byte per face this is 432 bits, so hashing won't save you too much space. A better packing of 3 bits per face comes to 162 bits (21 bytes). It sounds to me like you need a compact way to represent the rubik.

OTOH, if you are looking to store a set of many many previously-visited states then I've found that using a bloom filter instead of a true set gets me decent results (but often non-optimal) with much lower space utilization.

2) If you are married to the idea of a hash: Just use MD5, its slightly more compact than the proposed rubik states, rather fast, and has good collision properties - it's not like you have a malicious adversary trying to cause rubik cube hash collisions ;-).

EDIT: Using cryptographic hash functions, such as MD4/MD5, is usually simple once you have a library or function implementing the algorithm (ex: OpenSSL, GNU TLS, and many stand-alone implementations exist). Usually the function is something like void md5(unsigned char *buf, size_t len, unsigned char *digest) where digest points to a pre-allocated 16 byte buffer and buf is the data to be hashed (your rubik cube structure). Here is some untested C code:

#include <openssl/md5.h>
void main()
{
    unsigned char digest[16];
    unsigned char buf[BUFLEN];
    initializeBuffer(buf);
    MD5(buf,BUFLEN,digest);    // This is the openssl function
    printDigest(digest);
}

And be sure to compile/link with -lssl.

8 corner cubes:

You can assign each of these corners to 8 positions which each require 3 bits to determine which corner cube is at which position for a total of 24 bits.
You can further reduce this to just recording 7-of-8 positions as you can easily use a process of elimination to determine what the 8th corner is (for 21 bits).
However, this can be reduced further as the 8 corners can only be arranged in 8! = 40320 permutations and 40320 can be represented in 16 bits.

Each corner cube can be orientated correctly or be rotated 120° clockwise or anti-clockwise to be in three different positions (represented as 0, 1 and 2 respectively).

This requires 2 bits per corner to represent.
However, the sum of the orientations (modulo 3) is always 0; so, if you know 7-of-8 orientations then (assuming you have a solvable cube) you can calculate the orientation of the 8th corner (giving a total of 14 bits).
Or for a further reduction, seven ternary (base 3) digits can represent the orientation of the corners and this can be represented in 12 binary digits (bits).

So the corners cubes can be represented in 28 bits, if you want to decode the permutations, or in 33 bits, if you want to directly record the positions of 7-of-8 corners.

12 edge cubes:

Each can be represented in 4 bits (for a total of 48 bits) which can be reduced to 44 bits by only recording the position of 11-of-12 edges (for a total of 44 bits).
However, the 12! = 479001600 permutations of the edges can be stored in 29 bits.

Each edge can be either be oriented correctly or flipped:

This requires 1 bit to represent.
However, edges are always flipped in pairs so the parity of the flipped edges will always be zero (again, meaning that you only need to record 11-of-12 orientations for the edges) giving a total of 11 bits required.

So edge cubes can be represented in 40 bits, if you want to decode the permutations, or in 55 bits if you want to record all the positions and flips of 11-of-12 edges.

6 centre cubes

You do not need to record any information about the centre cubes - they are fixed relative to the ball at the centre of the Rubik's cube (so assuming you are not worried about the orientation of any logos on the cube) are immobile.

Total:

Using permutations: 68 bits
Using positions: 88 bits

Just to establish the theoretical minimum representation - the state space of a valid Rubik's cube is about 4.3*10^19. Log₂(4.3*10^19) will then determine how many bits you need to represent that full space, the ceiling of which is 66. So in theory, if you could number every valid state, any given state could be uniquely represented in 66 bits.

While you may want to follow others' advice and find a more compact way of representing the cube, consider representing the state in terms of edge, corner, and face pieces. Due to the swapping laws of legal cube moves, you should be able to concatenate a sequence of 12 4-bit edge locations, 8 3-bit corner locations, and 6 3-bit face locations. This should result in a unique representation using 90 bits.

This representation may not be conducive to the way you are creating your tree, but it is unique, easily comparable, and should be possible to find given a state in your existing representation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow