Question

I am processing XML files from a third party. These files occasionally have invalid characters in them which causes XMLTextReader.Read() to throw an exception.

I am currently handling this with the following function:

XmlTextReader GetCharSafeXMLTextReader(string fileName)
{
    try
    {
        MemoryStream ms = new MemoryStream();
        StreamReader sr = new StreamReader(fileName);
        StreamWriter sw = new StreamWriter(ms);
        string temp;
        while ((temp = sr.ReadLine()) != null)
            sw.WriteLine(temp.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), ""));

        sw.Flush();
        sr.Close();
        ms.Seek(0, SeekOrigin.Begin);
        return new XmlTextReader(ms);
    }
    catch (Exception exp)
    {
        throw new Exception("Error parsing file: " + fileName + " " + exp.Message, exp.InnerException);
    }
}

My gut is saying there should be a better/faster way to do this. (And yes, getting the third party to fix their XMLs would be great, but it's not happening at this point.)

EDIT: Here is the final solution, based on cfeduke's answer:


    public class SanitizedStreamReader : StreamReader
    {
        public SanitizedStreamReader(string filename) : base(filename) { }
        /* other ctors as needed */
        // this is the only one that XmlTextReader appears to use but
        // it is unclear from the documentation which methods call each other
        // so best bet is to override all of the Read* methods and Peek
        public override string ReadLine()
        {
            return Sanitize(base.ReadLine());
        }

        public override int Read()
        {
            int temp = base.Read();
            while (temp == 0x4 || temp == 0x14)
                temp = base.Read();
            return temp;
        }

        public override int Peek()
        {
            int temp = base.Peek();
            while (temp == 0x4 || temp == 0x14)
            {
                temp = base.Read();
                temp = base.Peek();
            }
            return temp;
        }

        public override int Read(char[] buffer, int index, int count)
        {
            int temp = base.Read(buffer, index, count);
            for (int x = index; x < buffer.Length; x++)
            {
                if (buffer[x] == 0x4 || buffer[x] == 0x14)
                {
                    for (int a = x; a < buffer.Length - 1; a++)
                        buffer[a] = buffer[a + 1];
                    temp--; //decrement the number of characters read
                }  
            }
            return temp;
        }

        private static string Sanitize(string unclean)
        {
            if (unclean == null)
                return null;
            if (String.IsNullOrEmpty(unclean))
                return "";
            return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
        }
    }
Was it helpful?

Solution

Sanitizing data is important. Sometimes edge cases - invalid characters in "XML" - do occur. Your solution is correct. If you want a solution that fits into the .NET framework in regards to streaming restructure your code to fit into its own Stream:

public class SanitizedStreamReader : StreamReader {
  public SanitizedStreamReader(string filename) : base(filename) { }
  /* other ctors as needed */

  // it is unclear from the documentation which methods call each other
  // so best bet is to override all of the Read* methods and Peak
  public override string ReadLine() {
    return Sanitize(base.ReadLine());
  }

  // TODO override Read*, Peak with a similar logic as this.ReadLine()
  // remember Read(Char[], Int32, Int32) to modify the return value by
  // the number of removed characters

  private static string Sanitize(string unclean) {
    if (String.IsNullOrEmpty(unclean)
      return "";
    return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14);
  }
}

With this new SanitizedStreamReader you'll be able to chain it into processing streams as necessary, rather than relying on a magic method to clean things and present you with an XmlTextReader:

return new XmlTextReader(new SanitizedStreamReader("filename.xml"));

Admittedly this may be more work than necessary but you will gain flexibility from this approach.

OTHER TIPS

XML concerns aside, if the file is not large enough to warrant processing sequentially, I would simplify the code to something along those lines:

var xml = File.ReadAllText(pathName);
var fixedXml = xml.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
File.WriteAllText(pathName, fixedXml);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top