Question

I have a large XML file (3 MB+) and i have an XSD to validate it against.

I am using python and LXML. I started from this script <>. Which does the validation fine, including giving me the line number. But the problem the file is on one line, so when i validate all i get is the error showing on line 1. when i use pretty print to split the lines up for me it maxes out at line 65535.

Thanks!

Was it helpful?

Solution

Pretty-print your XML to add newlines to it. Then put it through your validator to get a more helpful line number.

EDIT: On re-reading your question, I see that you have used Notepad++ to add the newlines. But that LXML apparently has a size limitation when it comings to validating your XML.

For a general approach to this problem, please see Validating a HUGE XML file. In particular, the accepted answer begins with:

Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.

Basically, you need to use a streaming approach which SAX offers. So, if your requirement is that you must validate your file in Python, then you'll need to find validation approach based on streaming. (Perhap LXML offers validation in a streaming fashion?)

However, if your validation requirements are more flexible, then consider a specialized tool such as XMLStarlet.

For example, here's how to validate an XML file against an XSD from the XMLStarlet entry on Wikipedia:

xmlstarlet val -e -s my.xsd my.xml

And testimonial on using XMLStarlet on very large files.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top