Question

I have an string like following.

<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&Y Ltd. &amp; M.K. Ltd will be merged.

How can I make it valid XML so my etree.XMLParser does not throw error. I need to convert it to something like.

<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&amp;Y Ltd. &amp; M.K. Ltd will be merged.

For this I tried to use tidylib. But it removed all the custom tags. See the code

options = {
    'wrap': 0,
    'indent': 0,
    'output-xhtml': 1,
    'numeric-entities': 1
}
html, warnings = tidylib.tidy_fragment(data, options)

Output is

LUSAKA (AP) -- X&amp;Y Ltd. &amp; M.K. Ltd will be merged.
Was it helpful?

Solution

>>> from lxml import etree
>>> tree = etree.fromstring('<GPE>LUSAKA</GPE> (<ORG>AP</ORG>) -- X&Y Ltd. &amp; M.K. Ltd will be merged.', etree.HTMLParser())
>>> etree.tostring(tree)
'<html><body><gpe>LUSAKA</gpe> (<org>AP</org>) -- X&amp;Y Ltd. &amp; M.K. Ltd will be merged.</body></html>'
>>> tree.xpath('//gpe/text()')
['LUSAKA']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top