Question

I'm working on a program that parses the various sgml files of reuters dataset. But the documents I found don't contain a root node, that encompasses all the children. It just has a set of <reuters>..</reuters> tags after DTD. So parsing the tree and using getroot() gives only the first <reuters> tag, and not the whole list. How can I work around it without changing the input files ? My code is given below:

import os
from lxml import etree as ET

dirname = 'dataset'

for filename in os.listdir(dirname):
    filepath = os.path.join(dirname, filename)

    parser = ET.parser(encoding='utf-8', recover=True)

    tree = ET.parse(filepath, parser)

    root = tree.getroot()

this root element is just the first <reuters> tag, while the sgml file is as below:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<reuters> .. </reuters>
<reuters> .. </reuters>
<reuters> .. </reuters>

What I want is to have all <reuters> tags, one at a time and work on their contents.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top