Question

So far, what I'm doing is:

try 
{
    XmlDocument xmldoc = loadXml(orderFilePath);
}
catch (XmlException exception)
{
    //... blah blah - there was an error, let the user know
}

But I would really like to be able to attempt to parse the file anyway. When I say "malformed" I don't necessarily mean that there will be an unclosed tag or element, but that there might be something like one of the following included in an element's value: '<', '>', '&'

I've seen mentioned around that I would probably have to use XmlReader - but would that still throw an exception on that element, or allow me to fix the problem in some way?

I know fixing the XML at the source is the best solution, but I do not control where the XML is coming from.

Thanks!

EDIT:

Super simple example of the XML:

<Order>
  <Customer_ID>555-555-5555</Customer_ID>
  <ShipToAddress>
    <Customer_Name>Some Guy</Customer_Name>
    <Street>123 Fake Dr.</Street>
    <Street2></Street2>
    <City>West Palm Beach</City>
    <State>FL</State>
    <ZipCode>33417</ZipCode>
    <Country>United States</Country>
  </ShipToAddress>
  <BillToAddress>
    <Customer_Name>Some Guy</Customer_Name>
    <Street>123 Fake Dr.</Street>
    <Street2></Street2>
    <City>West Palm Beach</City>
    <State>FL</State>
    <ZipCode>33417</ZipCode>
    <Country>United States</Country>
  </BillToAddress>
  <items>
    <item>
      <Product_ID>25101</Product_ID>
      <Product_Name></Product_Name>
      <Quantity>1</Quantity>
      <USPrice>26.95000</USPrice>
    </item>
  </items>
<!-- bad stuff here -->
<How_did_you_hear_about_us>Coffee & Tea magazine</How_did_you_hear_about_us>
<!-- bad stuff here -->
</Order>

The thing is - I don't necessarily know if it will always be in the same place.

Was it helpful?

Solution

One approach could be to validate a few things before parsing it. You could use a regex to validate the XML tags, but perhaps more easier could be a Stack where you add every < and > symbol on. Afterwards just loop trough it and assert that you don't get the same symbol twice in a row.

This raises the question: how do you distinguish between <MyElement>> and <MyEl>ement>?

This is all pretty vague though: what do you want to happen when the XML turns out to be invalid? How far do you want to take this pre-processing validation?

I believe that the best option here is to not proceed. You can't fix every issue with malformed XML thrown at you and it might just be better to inform the user and make that the end.

If the source is consistently sending malformed XML at you, you'll have to contact the maintainers or look for alternatives.

OTHER TIPS

As others have mentioned - there are a couple of things to do here:

Step 1 - Find out whether XML is malformed on not. For both Element and Value (or Attribute) Solution: Use Regex or load through String Builder and parse/look for characters (Regex is always better)

Step 2: You can also form an XSD if you want to validate that certain elements have always come (bare minimum). Based on workflow - if those dont appear - you can throw error - depends on your workflow

Step 3: Once you have parsed/fixed the XML - you then need to consume the values Solution: LINQ to XML is really a good approach here to pull values for what you are interested and not malformed

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top