Question

I am currently converting our old parsers that run on XmlDocument to the XDocument. I do this mainly to get the Linq querying and the added linenumber info.

My xml contains an element like this:

<?xml version="1.0"?>
<fulltext>
    hello this is a failed textnode
    &#xC;
    and I don't know how to parse it.
</fulltext>

My problem is that while XmlDocument seems to have no problem reading that node with:

var xmlDocument = new XmlDocument();

var physicalPath = GetPhysicalPath(uploadFolderFile);
try
{
    xmlDocument.Load(physicalPath);
}
catch (XmlException xmlException)
{
    _log.Warn("Problems with the document", xmlException);
}

The example above parses the document fine but when I try to do:

XDocument xmlDocument;
var physicalPath = GetPhysicalPath(uploadFolderFile);
var xmlStream = new System.IO.StreamReader(physicalPath);
try
{
   xmlDocument = XDocument.Load(xmlStream, LoadOptions.SetLineInfo | LoadOptions.SetBaseUri);
}
catch (XmlException)
{
   _log.Warn("Trying to clean document for HexaDecimal", xmlException);
}

It fails to read the document because of the character &#xC; The special character seems to be allowed in XML version 1.1 but changing the description doesn't help. I have thought about just parsing the document with XmlDocument and then converting it; but that seems to be counterintuitive. Can anybody help with this problem?

Was it helpful?

Solution

Ok...so I sort of found a solution to this problem.

First of all I try to parse the xml using the following code:

private XDocument GetXmlDocument(String physicalPath)
    {
        XDocument xmlDocument;
        var xmlStream = new System.IO.StreamReader(physicalPath);
        try
        {
            xmlDocument = XDocument.Load(xmlStream, LoadOptions.SetLineInfo);
        }
        catch (XmlException)
        {
            //_log.Warn("Trying to clean document for HexaDecimal", xmlException);
            xmlDocument = XmlSanitizingStream.TryToCleanXMLBeforeParsing(physicalPath);
        }

        return xmlDocument;
    }

If it fails to load the document, then I will try to clean it using the technique used in this blogpost: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/

It will not remove the character I mentioned before, but it will remove any character not allowed by the XML standard.

Then, after sanitizing the XML, I add an XMLReader and set its settings to not check characters:

public static XDocument TryToCleanXMLBeforeParsing(String physicalPath)
{
    string xml;

    Encoding encoding;
    using (var reader = new XmlSanitizingStream(File.OpenRead(physicalPath)))
    {
        xml = reader.ReadToEnd();
        encoding = reader.CurrentEncoding;
    }
    byte[] encodedString;
    if (encoding.Equals(Encoding.UTF8)) encodedString = Encoding.UTF8.GetBytes(xml);
    else if (encoding.Equals(Encoding.UTF32)) encodedString = Encoding.UTF32.GetBytes(xml);
    else encodedString = Encoding.Unicode.GetBytes(xml);

    var ms = new MemoryStream(encodedString);
    ms.Flush();
    ms.Position = 0;

    var settings = new XmlReaderSettings {CheckCharacters = false};
    XmlReader xmlReader = XmlReader.Create(ms, settings);
    var xmlDocument = XDocument.Load(xmlReader);
    ms.Close();
    return xmlDocument;
}

Since I've cleaned the document removing illegal characters before I add the ignore characters to the reader, I am pretty sure that I do not read a malformed XML document. Worst case scenario is I get a malformed XML and it will throw an error anyways.

I only use this for parsing and it should only be used to read the data. This will not make the XML well-formed and will in many cases throw exceptions elsewhere in your code. I am only using this because I cannot change what the customer is sending us and I have to read it as is.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top