Python 3.3: Process inlineXML

https://stackoverflow.com/questions/15701906

30-03-2022
|

Question

Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:

A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.

Of course processing any XML without a root does not work, so I added this:

<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>

I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):

lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793

Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.

La solution

In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:

import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
    regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
    entities[cat] = re.findall(regex_cat,data)

Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow