How to iterate over GraphML file with lxml
Question
I have the following GraphML file 'mygraph.gml' that I want to parse with a simple python script:
This represents a simple graph with 2 nodes "node0", "node1" and an edge between them
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="name" for="node" attr.name="name" attr.type="string"/>
<key id="weight" for="edge" attr.name="weight" attr.type="double"/>
<graph id="G" edgedefault="directed">
<node id="n0">
<data key="name">node1</data>
</node>
<node id="n1">
<data key="name">node2</data>
</node>
<edge source="n1" target="n0">
<data key="weight">1</data>
</edge>
</graph>
</graphml>
This represents a graph with two nodes n0 and n1 with an edge of weight 1 between them. I want to parse this structure with python.
I wrote a script with the help of lxml (I need to use it because the dataset in much much bigger than this simple example, more than 10^5 nodes, python minidom is too slow)
import lxml.etree as et
tree = et.parse('mygraph.gml')
root = tree.getroot()
graphml = {
"graph": "{http://graphml.graphdrawing.org/xmlns}graph",
"node": "{http://graphml.graphdrawing.org/xmlns}node",
"edge": "{http://graphml.graphdrawing.org/xmlns}edge",
"data": "{http://graphml.graphdrawing.org/xmlns}data",
"label": "{http://graphml.graphdrawing.org/xmlns}data[@key='label']",
"x": "{http://graphml.graphdrawing.org/xmlns}data[@key='x']",
"y": "{http://graphml.graphdrawing.org/xmlns}data[@key='y']",
"size": "{http://graphml.graphdrawing.org/xmlns}data[@key='size']",
"r": "{http://graphml.graphdrawing.org/xmlns}data[@key='r']",
"g": "{http://graphml.graphdrawing.org/xmlns}data[@key='g']",
"b": "{http://graphml.graphdrawing.org/xmlns}data[@key='b']",
"weight": "{http://graphml.graphdrawing.org/xmlns}data[@key='weight']",
"edgeid": "{http://graphml.graphdrawing.org/xmlns}data[@key='edgeid']"
}
graph = tree.find(graphml.get("graph"))
nodes = graph.findall(graphml.get("node"))
edges = graph.findall(graphml.get("edge"))
This script gets correctly the nodes and edges so that I can simply iterate over them
for n in nodes:
print n.attrib
or similarly on edges:
for e in edges:
print (e.attrib['source'], e.attrib['target'])
but I can't really understand how to get the "data" tag for the edges or the nodes in order to print the edge weight and nodes tag "name".
This doesn't work for me:
weights = graph.findall(graphml.get("weight"))
the last list is always empty. Why? I'm missing something around but can't understand what.
Solution
You can't do it in one pass, but for each node found, you can build a dict with the key/value of data:
graph = tree.find(graphml.get("graph"))
nodes = graph.findall(graphml.get("node"))
edges = graph.findall(graphml.get("edge"))
for node in nodes + edges:
attribs = {}
for data in node.findall(graphml.get('data')):
attribs[data.get('key')] = data.text
print 'Node', node, 'have', attribs
It give the result:
Node <Element {http://graphml.graphdrawing.org/xmlns}node at 0x7ff053d3e5a0> have {'name': 'node1'}
Node <Element {http://graphml.graphdrawing.org/xmlns}node at 0x7ff053d3e5f0> have {'name': 'node2'}
Node <Element {http://graphml.graphdrawing.org/xmlns}edge at 0x7ff053d3e640> have {'weight': '1'}