This answer comes a bit late, but still I'd like to share it:
I used networkx and lxml (which I found to allow much more elegant traversal of the DOM-tree). However, the tree-layout depends on graphviz and pygraphviz installed. networkx itself would just distribute the nodes somehow on the canvas. The code actually is longer than required cause I draw the labels myself to have them boxed (networkx provides for drawing the labels but it doesn't pass on the bbox
keyword to matplotlib).
import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout
raw = "...your raw html"
def traverse(parent, graph, labels):
labels[parent] = parent.tag
for node in parent.getchildren():
graph.add_edge(parent, node)
traverse(node, graph, labels)
G = nx.DiGraph()
labels = {} # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)
pos = graphviz_layout(G, prog='dot')
label_props = {'size': 16,
'color': 'black',
'weight': 'bold',
'horizontalalignment': 'center',
'verticalalignment': 'center',
'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
'fc': "grey",
'ec': "b",
'lw': 1.5}
nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()
for node, label in labels.items():
x, y = pos[node]
ax.text(x, y, label,
bbox=bbox_props,
**label_props)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()
Changes to the code if you prefer (or have) to use BeautifulSoup:
I'm no expert... just looked at BS4 for the first time,... but it works:
#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString
...
def traverse(parent, graph, labels):
labels[hash(parent)] = parent.name
for node in parent.children:
if isinstance(node, NavigableString):
continue
graph.add_edge(hash(parent), hash(node))
traverse(node, graph, labels)
...
#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)
...