Question

I would like to add the output of a parts-of-speech tagger to an existing xml file with the POS-tags as attribute value pairs to the existing word element:

house/N + <w>house</w> --> <w pos="N">house</w>

I thought I could give unique IDs to the words, match those and then add the POS-tag to the existing xml file, so I designed the following function in Python:

import xml.etree.ElementTree as ET

def add_postags(POSfile, xmlfile):
    """
    Function that takes two arguments (POSfile, xmlfile).
    If the value of the word <w>'s attribute 'id' in the POSfile matches
    the value of 'id' in the existing xml file,
    it adds the pos tags that are stored as attribute-value pairs in (POSfile)
    to the xml file and writes this to a new document 'xmlPOS'.
    """

    treePOS = ET.parse(POSfile)
    rootPOS = treePOS.getroot()
    tree = ET.parse(xmlfile)
    root = tree.getroot()


    for w in rootPOS.iter('w'):
        idPOS = w.get('id')

    for w in root.iter('w'):
        idxml = w.get('id')

    for w in rootPOS.iter('w'):
        POSval = w.get('pos')

    if idPOS == idxml:        
        w.set('pos', POSval)

    tree.write('xmlPOS.xml')

    return xmlPOS

For this to work I'd have to convert the tagger output 'house/N' to an xml format:

<w id="1" pos="N">house</w>

But even if I do so and then import the above module in Python, I seem to be unable to add the POS tags to the existing xml file (which contains more editorial markup of course than the above example). Perhaps I should use XSLT instead of this Python xml parser? I'm not very familiar with XSLTs yet, so I thought I'd try this in Python first.

Any comments or suggestions will be much appreciated: thanks in advance!

Was it helpful?

Solution

The set method is the appropriate way to set attributes in ElementTree, and I just tested that it works when applied to an XML file read from disk.

I wonder if your problem is algorithmic--- the algorithm you wrote doesn't look like it does what you want. The idPOS, idxml, and POSval are going to be equal to the last matching values in each file and w is going to be equal to the last <w> tag. It can only change one word, the last one. If you're going to be setting part of speech attributes in bulk, perhaps you want something more like the following (you may need to tweak the it if I've made some wrong assumptions about how POSfile is structured):

# load all "pos" attributes into a dictionary for fast lookup
posDict = {}
for w in rootPOS.iter("w"):
    if w.get("pos") is not None:
        posDict[w.text] = w.get("pos")

# if we see any matching words in the xmlfile, set their "pos" attrbute
for w in root.iter("w"):
    if w.text in posDict:
        w.set("pos", posDict[w.text])

OTHER TIPS

I've performed the tagging, but I need to write te output into the xml file. The tagger output looks like this:

The/DET house/N is/V big/ADJ ./PUNC

The xml file from which the text came will look like this:

<s>
 <w>The</w>
 <w>house</w>
 <w>is</w>
 <w>big</w>
 <w>.</w>
</s>

Now I would like to add the pos-tags as attribute-value pairs to the xml elements:

<s>
 <w pos="DET">The</w>
 <w pos="N">house</w>
 <w pos="V">is</w>
 <w pos="ADJ">big</w>
 <w pos="PUNC">.</w>
</s>

I hope this sample in English makes it clear (I'm actually working on historical Welsh).

I have now managed to do something like this with ElementTree:

import sys
import os
import re
import tree

def xmldump(file_name, xmldump):

    """
    Function takes one argument (file_name), and returns a list
    containing (for every sentence) a list of word-pos pairs
    It then converts this output to xml.
    """

text = ' '.join(open(file_name).readlines())

#split the text into sentences
sentences = re.split("\.\/PUNC", text)

xmlcorpus = []

#convert sentences to xml    
for s in sentences:
    t = tree.xml(s)
    xmlcorpus.append(t)

#write xmlcorpus to new file
with open(xmldump, 'w') as f:
    for sent in xmlcorpus:
        f.write(sent)

return xmldump

This sort of works, although there are now 'chink' and 'chunk' elements automatically generated by the ElementTree 'tree' module that I can't get rid of somehow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top