Question

I'm trying to read in an xml file which looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<incollection>
<author>Jos&eacute; A. Blakeley</author>
</incollection>
</dblp>

The point that creates the problem looks is the

Jos&eacute; A. Blakeley

part: The parser calls its character handler twice, once with "Jos", once with " A. Blakeley". Now I understand this may be the correct behaviour if it doesn't know the eacute entity. However, this is defined in the dblp.dtd, which I have. I don't seem to be able to convince expat to use this file, though. All I can say is

p = xml.parsers.expat.ParserCreate()
# tried with and without following line
p.SetParamEntityParsing(xml.parsers.expat.XML_PARAM_ENTITY_PARSING_ALWAYS) 
p.UseForeignDTD(True)
f = open(dblp_file, "r")
p.ParseFile(f)

but expat still doesn't recognize my entity. Why is there no way to tell expat which DTD to use? I've tried

  • putting the file into the same directory as the XML
  • putting the file into the program's working directory
  • replacing the reference in the xml file by an absolute path

What am I missing? Thx.

Was it helpful?

Solution

As I understand it, if you're using pyexpat directly, then you have to provide your own ExternalEntityRefHandler to fetch the external DTD and feed it to expat.

See eg. xml.sax.expatreader for example code (method external_entity_ref, line 374 in Python 2.6).

It would probably be better to use a higher-level interface such as SAX (via expatreader) if you can.

OTHER TIPS

btw I can temporarily help myself by copying the relevant parts of the .dtd into the XML file itself, as in

<!DOCTYPE dblp [
    <!ENTITY Agrave  "&#192;" >
]>

but that doesn't really solve the problem in a general way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top