Pregunta

First time poster. I'll try to be as specific as possible. To narrow questions down, I have no control over what the xml document looks like (I have to make the parser work with the document as is). The file is well formed (there's nothing telling me the document is not well formed and I see no reason as to why it wouldn't be). I'm not getting any errors back from the program (or exceptions from the parser). Anyway...

I'm feeding in an xml file (utf-8 encoding) into the sax parser and pulling out the info between tags that I need (attributes when needed also). This document has a lot of nested tags (and some tags that are named the same). To ensure I'm getting to the area in the document where the information that I need is stored, I'm using a series of flags that I set/reset (set when I see a start tag, reset when I see an end tag). If certain conditions are met (depending on what flags are set), in the content function of the content handler I append the information to a list that is held within an object. I don't modify the content in any way, and then write the contents of the object to a file.

When it reads the content in, the sax parser replaces escaped characters. So this:

<name>D &amp; C YELLOW NO. 10</name>

should become this:

D & C YELLOW NO. 10

But in the file and when content is printed to the console (in the content handler's characters function), the string reads as:

D 

That D is followed by a space in the file and in the console printing. My question is, is this some sort of bug or is there something I'm missing?

EDIT: Relevant code provided. xmlFile is just a string holding a file name (i.e. like test.xml).

XMLContentHandler=NIHXMLparser.XMLContentHandler()
xml.sax.parse(xmlFile,XMLContentHandler)

As I'm not modifying the content of the file in any way and just pulling it, I'll provide the skeleton of the parser.

class XMLContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        #initializing some flags to false
    def startElement(self, name, attrs):
        #set flags according to what tag
        #names appear.
    def characters(self,content):
        #depending on certain flags being set
        #I just pull out the info between there.
        #No modifications made. The sax parser
        #parses the content variable on its own.
        #I have no control over what it sends back.
    def endElement(self,name):
        #resets flags here.
¿Fue útil?

Solución

Yes, you are missing something. From the xml.sax.ContentHandler.characters documentation:

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...


You might try collecting the text in .characters() and emitting it in endElement, like so:

#! /usr/bin/python

import xml
import xml.sax
import StringIO

class NIHXMLparser:
  class XMLContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.name = False
        self.content = ''
    def startElement(self, name, attrs):
        if name == 'name':
            self.name = True
    def characters(self,content):
        self.content += content
    def endElement(self,name):
        if self.name and name == 'name':
            self.name = False
            print self.content
            self.content = ''

xmlText = r'<name>D &amp; C YELLOW NO. 10</name>'
xmlFile = StringIO.StringIO(xmlText)

XMLContentHandler=NIHXMLparser.XMLContentHandler()
xml.sax.parse(xmlFile,XMLContentHandler)
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top