C++ Arabica (over Xerces-c) getNodeValue() method does not return the actual value

https://stackoverflow.com/questions/14071736

12-12-2021
|

質問

I am using Arabica wrapping over Xerces-c to parse XML. The sample code below returns correct names when using .getNodeName() method, but not the correct value when using .getNodeValue() method:

bool readXML(bfs::path xmlfullfile) 
{
  // first check to see if the file exists
  if (!bfs::is_regular_file(xmlfullfile)) return false;

  Arabica::SAX2DOM::Parser<std::string> domParser;
  Arabica::SAX::CatchErrorHandler<std::string> eh;
  Arabica::DOM::Document<std::string> xmlDoc; 
  Arabica::SAX::InputSource<std::string> is;

  domParser.setErrorHandler(eh);
  is.setSystemId(xmlfullfile.string());
  domParser.parse(is);

  if(!eh.errorsReported()) 
  {
    xmlDoc = domParser.getDocument();
    xmlDoc.normalize();

    Arabica::DOM::NodeList<string_type> objects = xmlDoc.getElementsByTagName("object");
    for (size_t t = 0; t < objects.getLength(); t++) 
    {
      Arabica::DOM::Node<std::string> object = objects.item(t);
      Arabica::DOM::NodeList<std::string> values = object.getChildNodes(); 
      for (size_t u = 0; u < values.getLength(); u++) 
      {
        values.item(u).normalize(); 
        string name = values.item(u).getNodeName(); 
        string val = values.item(u).getNodeValue(); 
        cout << "Node streaming = \"" << values.item(u) << "\", meaning that name = \"" << name << "\" and value = \"" << val << "\"" << endl; 
      }
    }
    return true;
  } else {
    std::cerr << eh.errors() << std::endl;
    eh.reset();
    return false;
  }
}

The sample XML I'm trying to parse is:

<annotation>
    <filename>1a.jpg</filename>
    <folder>Sample</folder>
    <source>
        <database>Some database</database>
        <annotation>Annotator</annotation>
        <image>Some source</image>
    </source>
    <size>
        <width>3264</width>
        <height>1840</height>
        <depth>0</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>somename</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>48</xmin>
            <ymin>671</ymin>
            <xmax>3213</xmax>
            <ymax>1616</ymax>
        </bndbox>
    </object>
</annotation>

The output looks similar to this:

Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<name>somename</name>", meaning that name = "name" and value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<pose>Unspecified</pose>", meaning that name = "pose" and valu
e = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<truncated>0</truncated>", meaning that name = "truncated" and
 value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<difficult>0</difficult>", meaning that name = "difficult" and
 value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<occluded>0</occluded>", meaning that name = "occluded" and va
lue = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<bndbox>
                        <xmin>48</xmin>
                        <ymin>671</ymin>
                        <xmax>3213</xmax>
                        <ymax>1616</ymax>
                </bndbox>", meaning that name = "bndbox" and value = ""
Node streaming = "
        ", meaning that name = "#text" and value = "
        "

Not quite sure what I'm doing wrong. Since getNodeName() returns the correct name (when it's not #text of course), the fact that getNodeValue() doesn't return anything makes me wonder.

解決 2

I found a solution after comparing my code with some other XML libraries. Apparently the value of a node is not a simple text field, and one has to get the first child of that simple leaf node to be able to access the text value. Not sure if the way I'm doing it is the best way, but here is the code in case someone else has the same problem:

for (size_t u = 0; u < values.getLength(); u++) 
{
  string name = values.item(u).getNodeName();
  if (name == "#text") continue;
  string val = values.item(u).getFirstChild().getNodeValue(); 
  cout << "Node streaming = \"" << values.item(u) << "\", meaning that name = \"" << name << "\" and value = \"" << val << "\"" << endl; 
}

Note: The production code should take into account the fact that not all nodes are simple leaf nodes. So my code is only half of the solution.

他のヒント

you are counting the white-space only text-nodes as well. Adding a DTD that doesn't allow text-nodes at that place could be helpful. A non-validating parser has to report all the white-space nodes, and is not allowed to make assumptions on what is ignorable and what not.

Bottomline, if you want to get rid of the white-space text nodes, you ll have to program that yourself in your DOM program

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow