Add POS tags as attribute to xml element

Question 1

The set method is the appropriate way to set attributes in ElementTree, and I just tested that it works when applied to an XML file read from disk.

I wonder if your problem is algorithmic--- the algorithm you wrote doesn't look like it does what you want. The idPOS, idxml, and POSval are going to be equal to the last matching values in each file and w is going to be equal to the last <w> tag. It can only change one word, the last one. If you're going to be setting part of speech attributes in bulk, perhaps you want something more like the following (you may need to tweak the it if I've made some wrong assumptions about how POSfile is structured):

# load all "pos" attributes into a dictionary for fast lookup
posDict = {}
for w in rootPOS.iter("w"):
    if w.get("pos") is not None:
        posDict[w.text] = w.get("pos")

# if we see any matching words in the xmlfile, set their "pos" attrbute
for w in root.iter("w"):
    if w.text in posDict:
        w.set("pos", posDict[w.text])

Question 2

I've performed the tagging, but I need to write te output into the xml file. The tagger output looks like this:

The/DET house/N is/V big/ADJ ./PUNC

The xml file from which the text came will look like this:

<s>
 <w>The</w>
 <w>house</w>
 <w>is</w>
 <w>big</w>
 <w>.</w>
</s>

Now I would like to add the pos-tags as attribute-value pairs to the xml elements:

<s>
 <w pos="DET">The</w>
 <w pos="N">house</w>
 <w pos="V">is</w>
 <w pos="ADJ">big</w>
 <w pos="PUNC">.</w>
</s>

I hope this sample in English makes it clear (I'm actually working on historical Welsh).

Question 3

I have now managed to do something like this with ElementTree:

import sys
import os
import re
import tree

def xmldump(file_name, xmldump):

    """
    Function takes one argument (file_name), and returns a list
    containing (for every sentence) a list of word-pos pairs
    It then converts this output to xml.
    """

text = ' '.join(open(file_name).readlines())

#split the text into sentences
sentences = re.split("\.\/PUNC", text)

xmlcorpus = []

#convert sentences to xml    
for s in sentences:
    t = tree.xml(s)
    xmlcorpus.append(t)

#write xmlcorpus to new file
with open(xmldump, 'w') as f:
    for sent in xmlcorpus:
        f.write(sent)

return xmldump

This sort of works, although there are now 'chink' and 'chunk' elements automatically generated by the ElementTree 'tree' module that I can't get rid of somehow.