libxml split text nodes at spaces

https://stackoverflow.com/questions/20624615

02-09-2022
|

Question

I am using libxml's HTML parser to create a dom tree of html documents. libxml gives text content of each node as a monolithic string (node), but my requirement is to further split each text node at spaces and create as many as word nodes. thus far I haven't found any options from libxml so I created a cpu expensive logic to split text nodes. Below is the part of recursive method that works.

void parse(xmlNodePtr cur, El*& parent) {

  if (!cur) {
    return;
  }

  string tagName = (const char*) cur->name;
  string content = node_text(cur); // function defined below

  Element* el = new Element(tagName, content);
  parent->childs.push_back(el);


  size_t pos;
  string text;
  cur = cur->children;
  while (cur != NULL) {
     if (xmlNodeIsText(cur) && (pos = node_text_find(cur, text, " ")) != string::npos) {

            string first = text.substr(0, pos);
        string second = text.substr(pos + 1);
            El *el1 = new Element("text", first);
            el->childs.push_back(el1);

            El *el2 = new Element("text", " ");
        el->childs.push_back(el2);

            xmlNodeSetContent(cur, BAD_CAST second.c_str());
        continue;
     }
     parse(cur, el);
     cur = cur->next;
  }
}

string node_text(xmlNodePtr cur) {
    string content;
    if (xmlNodeIsText(cur)) {
        xmlChar *buf = xmlNodeGetContent(cur);
        content = (const char*) buf;
    }
    return content;
}

size_t node_text_find(xmlNodePtr cur, string& text, string what){
    text = node_text(cur);
    return text.find_first_of(what);
}

The problem with above code is it didnt work for some UTF string like chinese language and moreover this code adds up time in overall parsing process.

Can anyone suggest a better way of doing this, thank you in advance !

Solution

I don't have a complete answer but I did see you doing explicit casts of xmlChar to char. That is a bad sign and probably why it doesn't work on Unicode.

If you're dealing with Unicode, which xmlChar probably is, you need to be using Unicode text processing libraries. Not std::string.

You actually have two choices. Find a library which processes in UTF-8 or convert UTF-8 into wchar (wide characters). If you convert to wchar then you can use wstring and its functions to process Unicode.

libxml2 xmlChar * to std::wstring looks like a useful answer.

As for speed, do my eyes deceive me or are you splitting on one space and creating a new element which you then split again? This is not the way to performance. I think it would go better if you remove the text node, split all of the words out and add the new nodes as you go.

The slowdown is most likely in the repeated creation, copying and destruction of objects. Work to minimize that. For example, if Element had a constructor form that accepted a begin/end iterator pair, or a start, length pair, that would be more efficient than creating a substring (copy!) and creating an Element (copy!) and then destroying the substrings.

The repeated calling of xmlNodeSetContent with the (probably large) second half of the text string, is giving you O² performance. Not good.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow