Question

I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import

My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?

I appreciate any suggestion.

Was it helpful?

Solution 2

Storing these documents on an array before inserting would be a good idea?

Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).

How many documents should I store before inserting, then?

That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.

Writing the jsons into a file and then using mongoimport would be faster?

No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.

Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)

OTHER TIPS

In case you want to import XML to MongoDB and Python is just what you so far chose to get this job done but you are open for further approaches then might also perform this with the following steps:

  1. transforming the XML documents to CSV documents using XMLStarlet
  2. transforming the CSVs to files containing JSONs using AWK
  3. import the JSON files to MongoDB

XMLStarlet and AWK are both extremely fast and you are able to store your JSON objects using a non-trivial structure (sub-objects, arrays).


http://www.joyofdata.de/blog/transforming-xml-document-into-csv-using-xmlstarlet/ http://www.joyofdata.de/blog/import-csv-into-mongodb-with-awk-json/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top