Performance bulk-loading data from an XML file to MySQL

https://stackoverflow.com/questions/10352374

04-06-2021
|

質問

Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?

I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...

The average table size is roughly:

Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000

Let me know if more info is needed!

Server Specs:

Note this is a linode VPS

Intel Xeon Processor L5520 - Quad Core - 2.27GHZ 4GB Total Ram

XML Sample

https://gist.github.com/2510267

Thanks!

After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.

解決

One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.

Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.

Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).

他のヒント

This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.

I have 3 quick suggesstions to make without seeing your code After attempting something similiar

optimize your code for high performance High-performance XML parsing in Python with lxml is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively

Doing these things greatly improved the speed of a similar project I worked on. Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow