Hash value for directed acyclic graph

Question 1

To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs.

Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py:

diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py   2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py        2013-01-28 22:09:07.000000000 -0200
@@ -31,7 +31,9 @@

 ext_pynauty = Extension(
         name = MODULE + '._pynauty',
-        sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+        sources = [ pynauty_dir + '/' + 'pynauty.c',
+            os.path.join(nauty_dir, 'schreier.c'),
+            os.path.join(nauty_dir, 'naurng.c')],
         depends = [ pynauty_dir + '/' + 'pynauty.h', ],
         extra_compile_args = [ '-O4' ],
         extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c      2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c   2013-01-29 00:38:36.000000000 -0200
@@ -320,7 +320,7 @@
     PyObject *adjlist;
     PyObject *p;

-    int i,j;
+    Py_ssize_t i, j;
     int adjlist_length;
     int x, y;

Question 2

Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.

Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.

The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n² bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.

How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2¹¹⁰ remaining graphs are acyclic and for each graph there may be up to 11! - roughly 2²⁵ - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?

How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.

There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.

Question 3

How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:

the total number of nodes
the total number of (directed) connections
the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n))

All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.

As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c instead of a -> b).

This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:

same as above but with second order in- and outdegree. Ie. the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.
or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) )
number of nodes with eccentricity = x (again for any x)
if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. Due to the xor the specific order in which the nodes where added to the hash does not matter.

You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).

Of course this is much more complex though.

Question 4

Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).

I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:

Sort the hashes of the node's neighbors
Hash the concatenated sorted hashes

On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.

To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).

For more on this you can look at my post here:

https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF

The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:

https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

Question 5

Imho, If the graph could be topologically sorted, the very straightforward solution exists.

For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (p.e. if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334})
For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N); I think the complexity of this procedure is around (N*N) in the worst case. If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part)

Question 6

I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.

A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:

1->2, 1->5
2->1, 2->4
3->4
5->3

What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}), here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3).

Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn), Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch.

In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5.

The alphabet(set of possible letters) for the set {1,2,3} would be(ordered lexicographically):

{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}

First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.

Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N. Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N. This is to show that the hash code grows really fast with the increase of N. Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.

There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++ but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:

typedef long long ll;

// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination. 
ll code(vector<int> a,int n)
{
    sort(a.begin(),a.end());  // not needed if the set you pass is already sorted.
    int cur = 0;
    int m = a.size();

    ll res =0;
    for(int i=0;i<a.size();i++)
    {
        if(a[i] == cur+1)
        {
            res++;
            cur = a[i];
            continue;
        }
        else
        {
            res++;
            int number_of_greater_nums = n - a[i];
            for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
                res += 1LL << (number_of_greater_nums+increment);
            cur = a[i];
        }
    }
    return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the 
// combination
vector<int> decode(ll kod, int n)
{
    vector<int> res;
    int cur = 0;

    int left = n; // Out of how many numbers are we left to choose.
    while(kod)
    {
        ll all = 1LL << left;// how many are the total combinations
        for(int i=n;i>=0;i--)
        {
            if(all - (1LL << (n-i+1)) +1 <= kod)
            {
                res.push_back(i);
                left = n-i;
                kod -= all - (1LL << (n-i+1)) +1;
                break;
            }
        }
    }
    return res;
}

Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.

Also have a look at this question.

Question 7

I'm not sure that it's 100% working, but here is an idea:

Let's code a graph into a string and then take its hash.

hash of an empty graph is ""
hash of a vertex with no outgoing edges is "."
hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (e.g. ",")

To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (e.g. in lexicographical order).

For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots).

edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. That happens when a vertex has several parents - then it "duplicated" for every parent. For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}.

I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily:

THash GetHash(const TGraph &graph)
{
    return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
    std::vector<std::string> childHashes;
    for(auto c:graph.GetChildren(vertex))
        childHashes.push_back(GetVertexStringCode(graph,*c));
    std::sort(childHashes.begin(),childHashes.end());
    std::string result=".";
    for(auto h:childHashes)
        result+=*h+",";
    return result;
}

Question 8

I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. This proposal is therefore based on isomorphism only.

For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. Here is a starter list:

2d histogram of nodes' in and out degrees.
4d histogram of edges a->b where a and b are both characterized by in/out degree.

Addition Let me be more explicit. For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I,O values), signifying that there are N nodes with in-degree I and out-degree O. You'd hash this set of triples or better yet use the whole set arranged in some canonical order e.g. lexicographically sorted. For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO, to nodes with bI and bO respectively. Again hash these quintuples or else use them in canonical order as-is for another part of the final hash.

Starting with this and then looking at collisions that still occur will probably provide insights on how to get better.

Question 9

When I saw the question, I had essentially the same idea as @example. I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs.

This tag consists of the sequence of out-degrees in ascending order. You can hash this tag with the string hash function of your choice to obtain a hash of the graph.

Edit: I expressed my proposal in the context of @NeilG's original question. The only modification to make to his code is to redefine the hashkey function as:

def hashkey(self): 
    return tuple(sorted(map(len,self.lt.values())))

Question 10

With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification.

Example code in this StackOverflow answer, the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent.

Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children.