Parse multiple XML elements from string, when some of which may be incomplete

https://stackoverflow.com/questions/22644207

21-06-2023
|

Question

I'm looking for a convenient want to parse a string that contains XML, but contains multiple elements some of which may be incomplete! To understand why I'm doing this, just assume the data is coming over a network connection and it may be incomplete at any given time, so it may look something like this:

Here is a single element:

"<note id='104'> <stuff> WEEE!</stuff> </note>"

Here is what I'm receiving:

myString = "<note id='104'> <stuff> WEEE!</stuff>   </note> <note id"

Notice that the second one (identical to the first) is cutoff. I want to parse this in a very easy automatic way such that I can read all correct ones and ignore all incomplete ones. THe obvious way is to look for the note and the matching /note but that doesn't always work because the line COULD be written like this and I can't just look for the /> because that could end up matching with something else... This is where it gets complicated, Hence the reason why I'm looking or something automatic to do this for me!

I'd love to be able to call some function that says

List<XmlDocument> xmlList = ExtractCompleteXMLDocsFromThisString(myString);

In our above case with myString, it would return one xmlDocument and NOT return the 2nd incomplete one.

Update: THe below code is working, but is not an efficient way of doing it.

        for ( int j = 0; j < myString.Length; j++ )
        {
            for ( int i = j; i < myString.Length+1; i++ )
            {
                string subString = string.Empty;
                try
                {
                    subString = myString.Substring(j, i);
                }
                catch(Exception e)
                {
                    Console.WriteLine("Can't Get SubString with j = " + j + " i = " + i + " myString.length = " + myString.Length);
                    Console.WriteLine(e.Message);
                    Console.WriteLine(e.StackTrace);
                }

                try
                {
                    XmlDocument subStringXML = new XmlDocument();
                    subStringXML.LoadXml(subString);
                    Console.WriteLine("Found a good one!");
                    // Extract good one
                    myString = myString.Remove(j, i);
                    Console.WriteLine(subString);
                    i -= subString.Length;
                }
                catch(Exception e)
                {
                    //                  Console.WriteLine("Can't parse:" + subString);
                }
            }
        }

Update 2: Tried Split technique, but found a better method, see Update 3.

Update 3: XmlReader can handle truncated documents! See code below. The only addition that needs to be made is simply handling the exceptions. But it'll parse out the Xml while it's valid, then hit an exception on ones that are not. This works perfectly for what I'm doing! Thanks.

        XmlReaderSettings settings = new XmlReaderSettings();
        settings.ConformanceLevel = ConformanceLevel.Fragment;
        using (XmlReader reader = XmlReader.Create(new StringReader(myString), settings))
        {
            while (reader.Read())
            {
                if (reader.NodeType == XmlNodeType.Element)
                {
                    Console.WriteLine("reader.Name = " + reader.Name);
                }
            }
        }

Solution

You made no mention of what you have tried already in your implementation. I haven't tested this with your example, but XmlReader is somewhat like using a SAX parser and so should be able to handle a truncated document like this. What I have in mind is something like:

using (XmlReader reader = XmlReader.Create(new StringReader("..."))
{
   while (reader.Read())
   {
      if (reader.NodeType == XmlNodeType.Element)
      {
         ...
      }
   }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow