Representing a non-binary tree as arcs will be difficult, but it is possible to nest "entity" annotations and use this for a constituency parse structure. Note that I'm not creating nodes for the terminals (part of speech tags) of the tree, partially because Brat is not currently good at displaying unary rules that often apply to terminals. The description of the target format is found here.
Firstly, we need a function to produce standoff annotations. While Brat seeks standoff in terms of characters, in the following we just use token offsets, and will convert to characters below.
(Note this uses NLTK 3.0b and Python 3)
def _standoff(path, leaves, slices, offset, tree):
width = 0
for i, child in enumerate(tree):
if isinstance(child, tuple):
tok, tag = child
leaves.append(tok)
width += 1
else:
path.append(i)
width += _standoff(path, leaves, slices, offset + width, child)
path.pop()
slices.append((tuple(path), tree.label(), offset, offset + width))
return width
def standoff(tree):
leaves = []
slices = []
_standoff([], leaves, slices, 0, tree)
return leaves, slices
Applying this to your example:
>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
[((0, 0, 0), 'Participant', 0, 1),
((0, 0, 1), 'Verbal-group', 1, 2),
((0, 0, 2), 'Participant', 2, 4),
((0, 0, 3), 'Circumstance', 4, 7),
((0, 0), 'Process-dependencies', 0, 7),
((0,), 'Clause', 0, 7),
((), 'S', 0, 8)])
This returns the leaf tokens, then a list of tuples corresponding subtrees with elements: (index into root, label, start leaf, stop leaf).
To convert this into character standoff:
def char_standoff(tree):
leaves, tok_standoff = standoff(tree)
text = ' '.join(leaves)
# Map leaf index to its start and end character
starts = []
offset = 0
for leaf in leaves:
starts.append(offset)
offset += len(leaf) + 1
starts.append(offset)
return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
for path, label, start_tok, end_tok in tok_standoff]
Then:
>>> char_standoff(tree)
('This is a representation of the grammar .',
[((0, 0, 0), 'Participant', 0, 4),
((0, 0, 1), 'Verbal-group', 5, 7),
((0, 0, 2), 'Participant', 8, 24),
((0, 0, 3), 'Circumstance', 25, 39),
((0, 0), 'Process-dependencies', 0, 39),
((0,), 'Clause', 0, 39),
((), 'S', 0, 41)])
Finally, we can write a function that converts this to Brat's format:
def write_brat(tree, filename_prefix):
text, standoff = char_standoff(tree)
with open(filename_prefix + '.txt', 'w') as f:
print(text, file=f)
with open(filename_prefix + '.ann', 'w') as f:
for i, (path, label, start, stop) in enumerate(standoff):
print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)
This writes the following to /path/to/something.txt:
This is a representation of the grammar .
and this to /path/to/something.ann:
T0 Participant 0 4 This
T1 Verbal-group 5 7 is
T2 Participant 8 24 a representation
T3 Circumstance 25 39 of the grammar
T4 Process-dependencies 0 39 This is a representation of the grammar
T5 Clause 0 39 This is a representation of the grammar
T6 S 0 41 This is a representation of the grammar .