I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.
A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:
1->2, 1->5
2->1, 2->4
3->4
5->3
What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3})
, here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3)
.
Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an)
and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn)
, Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an)
and (c1,c2, ...,cn)
and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1)
to (d_1_1, d_1_2...d_1_c1)
and so on until the first missmatch.
In fact what I propose to use as hash the lexicographical number of a word of size N
over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}
. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5}
is the subset consisting of two elements of the set, namely 2
and 5
.
The alphabet(set of possible letters) for the set {1,2,3}
would be(ordered lexicographically):
{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}
First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.
Now the number of subsets of the set {1,2,3,....N}
is 2^N
and thus the number of letters of this alphabet is 2^N
. Now we code each directed graph of N
nodes with a word with exactly N
letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N
. This is to show that the hash code grows really fast with the increase of N
. Also this is the number of possible different directed graphs with N
nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.
There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}
. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++
but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:
typedef long long ll;
// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination.
ll code(vector<int> a,int n)
{
sort(a.begin(),a.end()); // not needed if the set you pass is already sorted.
int cur = 0;
int m = a.size();
ll res =0;
for(int i=0;i<a.size();i++)
{
if(a[i] == cur+1)
{
res++;
cur = a[i];
continue;
}
else
{
res++;
int number_of_greater_nums = n - a[i];
for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
res += 1LL << (number_of_greater_nums+increment);
cur = a[i];
}
}
return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the
// combination
vector<int> decode(ll kod, int n)
{
vector<int> res;
int cur = 0;
int left = n; // Out of how many numbers are we left to choose.
while(kod)
{
ll all = 1LL << left;// how many are the total combinations
for(int i=n;i>=0;i--)
{
if(all - (1LL << (n-i+1)) +1 <= kod)
{
res.push_back(i);
left = n-i;
kod -= all - (1LL << (n-i+1)) +1;
break;
}
}
}
return res;
}
Also this code stores the result in long long
variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64
. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.
Also have a look at this question.