I don't have a complete answer but I did see you doing explicit casts of xmlChar
to char
. That is a bad sign and probably why it doesn't work on Unicode.
If you're dealing with Unicode, which xmlChar probably is, you need to be using Unicode text processing libraries. Not std::string.
You actually have two choices. Find a library which processes in UTF-8 or convert UTF-8 into wchar
(wide characters). If you convert to wchar
then you can use wstring
and its functions to process Unicode.
libxml2 xmlChar * to std::wstring looks like a useful answer.
As for speed, do my eyes deceive me or are you splitting on one space and creating a new element which you then split again? This is not the way to performance. I think it would go better if you remove the text node, split all of the words out and add the new nodes as you go.
The slowdown is most likely in the repeated creation, copying and destruction of objects. Work to minimize that. For example, if Element had a constructor form that accepted a begin/end iterator pair, or a start, length pair, that would be more efficient than creating a substring (copy!) and creating an Element (copy!) and then destroying the substrings.
The repeated calling of xmlNodeSetContent with the (probably large) second half of the text string, is giving you O2 performance. Not good.