In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data
is just a string of text. It uses regular expressions to find all entities of a category specified in NER
and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER
is just a list of all possible tags in the string.