Question

Flat files and relational databases give us a mechanism to serialize structured data. XML is superb for serializing un-structured tree-like data.

But many problems are best represented by graphs. A thermal simulation program will, for instance, work with temperature nodes connected to each others through resistive edges.

So what is the best way to serialize a graph structure? I know XML can, to some extent, do it---in the same way that a relational database can serialize a complex web of objects: it usually works but can easily get ugly.

I know about the dot language used by the graphviz program, but I'm not sure this is the best way to do it. This question is probably the sort of thing academia might be working on and I'd love to have references to any papers discussing this.

Was it helpful?

Solution

How do you represent your graph in memory?
Basically you have two (good) options:

in which the adjacency list representation is best used for a sparse graph, and a matrix representation for the dense graphs.

If you used suchs representations then you could serialize those representations instead.

If it has to be human readable you could still opt for creating your own serialization algorithm. For example you could write down the matrix representation like you would do with any "normal" matrix: just print out the columns and rows, and all the data in it like so:

   1  2  3
1 #t #f #f
2 #f #f #t
3 #f #t #f

(this is a non-optimized, non weighted representation, but can be used for directed graphs)

OTHER TIPS

Typically relationships in XML are shown by the parent/child relationship. XML can handle graph data but not in this manner. To handle graphs in XML you should use the xs:ID and xs:IDREF schema types.

In an example, assume that node/@id is an xs:ID type and that link/@ref is an xs:IDREF type. The following XML shows the cycle of three nodes 1 -> 2 -> 3 -> 1.

<data>
  <node id="1"> 
    <link ref="2"/>
  </node>
  <node id="2">
    <link ref="3"/>
  </node>
  <node id="3">
    <link ref="1"/>
  </node>
</data>

Many development tools have support for ID and IDREF too. I have used Java's JAXB (Java XML Binding. It supports these through the @XmlID and the @XmlIDREF annotations. You can build your graph using plain Java objects and then use JAXB to handle the actual serialization to XML.

XML is very verbose. Whenever I do it, I roll my own. Here's an example of a 3 node directed acyclic graph. It's pretty compact and does everything I need it to do:

0: foo
1: bar
2: bat
----
0 1
0 2
1 2

One example you might be familiar is Java serialization. This effectively serializes by graph, with each object instance being a node, and each reference being an edge. The algorithm used is recursive, but skipping duplicates. So the pseudo code would be:

serialize(x):
    done - a set of serialized objects
    if(serialized(x, done)) then return
    otherwise:
         record properties of x
         record x as serialized in done
         for each neighbour/child of x: serialize(child)

Another way of course is as a list of nodes and edges, which can be done as XML, or in any other preferred serialization format, or as an adjacency matrix.

Adjacency lists and adjacency matrices are the two common ways of representing graphs in memory. The first decision you need to make when deciding between these two is what you want to optimize for. Adjacency lists are very fast if you need to, for example, get the list of a vertex's neighbors. On the other hand, if you are doing a lot of testing for edge existence or have a graph representation of a markov chain, then you'd probably favor an adjacency matrix.

The next question you need to consider is how much you need to fit into memory. In most cases, where the number of edges in the graph is much much smaller than the total number of possible edges, an adjacency list is going to be more efficient, since you only need to store the edges that actually exist. A happy medium is to represent the adjacency matrix in compressed sparse row format in which you keep a vector of the non-zero entries from top left to bottom right, a corresponding vector indicating which columns the non-zero entries can be found in, and a third vector indicating the start of each row in the column-entry vector.

[[0.0, 0.0, 0.3, 0.1]
 [0.1, 0.0, 0.0, 0.0]
 [0.0, 0.0, 0.0, 0.0]
 [0.5, 0.2, 0.0, 0.3]]

can be represented as:

vals: [0.3, 0.1, 0.1, 0.5, 0.2, 0.3]
cols: [2,   3,   0,   0,   1,   4]
rows: [0,        2, null,  4]

Compressed sparse row is effectively an adjacency list (the column indices function the same way), but the format lends itself a bit more cleanly to matrix operations.

On a less academic, more practical note, in CubicTest we use Xstream (Java) to serialize tests to and from xml. Xstream handles graph-structured object relations, so you might learn a thing or two from looking at it's source and the resulting xml. You're right about the ugly part though, the generated xml files don't look pretty.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top