Question

I am reading a very large XML file, which I have to read as a stream, like so:

public IEnumerable<something> GetStuff()
{
    foreach(var gzipStream in GetGZips())
    {
        using (var reader = XmlReader.Create(gzipStream, new XmlReaderSettings{ CheckCharacters = false }))
        {
            reader.MoveToContent();

            while (reader.Read()) //<-- Exception here
            {
                //Do stuff
                yield return something;
            }
        }
    }
}

I get an invalid char exception, part-way through the processing:

' ', hexadecimal value 0x19, is an invalid character. Line 655, position 45.

Given you're not allowed to yield return inside a try-catch - what is a nice way to simply abort the processing of the current Xml doc (and completing the Enumeration), in the case of an error?

try/finally is no good - as the exception breaks the processing of the whole IEnumerable.

I'm not able to perform any pre-processing on the files.

Was it helpful?

Solution

If you really can't do any preprocessing, and absolutely must generate the enumeration while parsing the XML, what about if you replace your while loop with:

bool IsMoreXml = true;

while (IsMoreXml)
{
    var ValuesRead = null; //not sure what you're reading
    try
    {
        IsMoreXml = reader.Read();
        if(!IsMoreXml) break;
        //Do Stuff
        ValuesRead = whateverwereadfromxml;

    }
    catch (XmlException ex)
    {
        //do what you gotta do
        break;
    }

    if(ValuesRead != null)
        yield return ValuesRead;
}

There are other possible exceptions you should be handling though, but not sure if you were handling those where it's being called from. It's not elegant, but then I'm not sure what your limitations are (for ex, no preprocessing)

OTHER TIPS

I was just dealing with the same thing. I know this is old, but I figured I'd put it up here for reference.

I was going to put up a gist, but I think looking at the commit on GitHub will be more helpful.

https://github.com/DewJunkie/Log2Console/commit/fb000c0a97c6762b619d213022ddc750bd9254ae If you compare the prior version using winmerge, you'll get a much clearer picture of the change.

While you can't have the yield return inside of a try catch, you can have another function that returns a single parsed instance. The try catch will be in that 2nd function. I used a regex to split the log into single records. I would assume that even in a large file, a single record would still fit into a buffer of a few KB. I would also imagine that there is some overhead with the RegEx, but my main concern was loosing data.

I had actually spent a few hours writing a parser, and when I was testing I realized that the meat of my parser was this regex, and I actually didn't even need the rest.

TLDR;

// old method, very similar to what you had

while(!xmlreader.eof){xmlreader.read();}

// new method

IEnumerable<Foo> ParseFile(stream){
foreach(var match in Regex.Matches(xmlText,$"<(/?)\\s*(XML_RECORD_ELEMENT)[^<>]*(/?)>") 
{/*logic to split xml based on matches.
working code is in the above commit.   Not too long, but too long for MD. */
yield return ParseXmlFragment(xmlFragment);
...}
}

Foo ParseXmlFragment(string xmlFragment)
{
   Foo newFoo = new Foo();
   try{//xmlreader here to parse fragment}
   catch(ex)
   {
     // handle ex if possible here.  If not possible, you now have the complete text of the unparsable fragment, which you can correct and try again.
     throw; // if you want to halt execution, or you can continue
   }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top