How to optimize operations with large xml files (Download/Parsing)

Question

There are lots of options for optimization, depending on what you want to maximize.

If your processing is faster than download (and it's hard to imagine that your XPath-based search will be slow), your limiting factor will be download speed. You can use asynchronous requests to download multiple files at a time, but if all the files are coming from the same server it's unlikely that more than a handful of concurrent downloads will give you any performance increase.

You could create an XmlReader from the stream while you're downloading, and (I think, although I'm not sure) run your XPath expression against the stream. But that doesn't really give you any benefit.

I think you're unnecessarily worried about the large object heap. If you're downloading and processing one file at a time, each string will go into the LOH, get processed, and then be collected. Yes, there's the potential of fragmenting your large object heap, but if the files are all in the 8 to 10 MB range, it's highly unlikely in practice that you will have a problem. There would have to be a pathological arrangement of files.

And you don't really have to download to a string. You can pre-allocate a buffer of, say, 20 MB, and download to that buffer. Then wrap a MemoryStream areound it, create an XmlReader on that memory stream, etc. So your LOH won't get fragmented at all because you just re-use that 20 MB buffer. I really wouldn't go this route unless I absolutely had to, though.

Were I assigned this task, I'd do it in the simplest way possible. The limiting factor is going to be the download speed, so that's where I'd concentrate any optimization efforts. I wouldn't worry at all about potential LOH fragmentation, but keep the alternate solution in my back pocket just in case that crops up as a problem.

How you approach this really depends on how fast that XPath search is. If it takes milliseconds or even a few seconds to search a 10 MB XML file, then it makes no sense at all to worry about optimizing the search: the download time is going to dwarf the search time. Instead, I'd see if I could get two or four concurrent downloads, throw each string result into a BlockingCollection when it comes in, and have a consumer thread reading that queue and running the search. That consumer thread will probably spend a lot of its time idle, waiting for the next file to come down.

In short: make it work, then make it work fast.