Question

I need to connect to an external XML file to download and process (300MB+). Then run through the XML document and save elements in the database.

I am already doing this no problem on a production server with Saxerator to be gentle on memory. It works great. Here is my issue now --

I need to use open-uri (though there could be alternative solutions?) to grab the file to parse through. This problem is that open-uri has to load the whole file before anything starts parsing, which defeats the entire purpose of using a SAX Parser to save on memory... any work arounds? Can I just read from the external XML document? I cannot load the entire file or it crashes my server, and since the document is updated every 30 minutes, I can't just save a copy of it on my server (though this is what I am doing currently to make sure everything id working).

I am doing this Ruby, p.s.

Was it helpful?

Solution

You may want to try Net::HTTP's streaming interface instead of open-URI. This will give Saxerator (via the underlying Nokogiri::SAX::Parser) an IO object rather than the entire file.

OTHER TIPS

I took a few minutes to write this up and then realized you tagged this question with ruby. My solution is in Java so I apologize for that. I'm still including it here since it could be useful to you or someone down the road.

This is always how I've processed large external xml files

XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();
xmlReader.setFeature("http://xml.org/sax/features/namespaces", true);
XMLFilter filter = new XMLFilterImpl();

filter.setParent(xmlReader);

filter.parse(new InputSource(new BufferedReader(new InputStreamReader(new URL("<url to external document here>").openConnection().getInputStream(),"UTF8"))));
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top