Memory efficient way to import large files and data into MongoDB?

https://stackoverflow.com/questions/15846146

01-04-2022
|

Question

After recently experimenting with MongoDB, I tried a few different methods of importing/inserting large amounts of data into collections. So far the most efficient method I've found is mongoimport. It works perfectly, but there is still overhead. Even after the import is complete, memory isn't made available unless I reboot my machine.

Example:

mongoimport -d flightdata -c trajectory_data --type csv --file trjdata.csv --headerline

where my headerline and data look like:

'FID','ACID','FLIGHT_INDEX','ORIG_INDEX','ORIG_TIME','CUR_LAT', ...
'20..','J5','79977,'79977','20110116:15:53:11','1967', ...

With 5.3 million rows by 20 columns, about 900MB, I end up like this:

Overhead

This won't work for me in the long run; I may not always be able to reboot, or will eventually run out of memory. What would be a more effective way of importing into MongoDB? I've read about periodic RAM flushing, how could I implement something like with the example above?

Update: I don't think my case would benefit much from adjusting fsync, syncdelay, or journaling. I'm just curious as to when that would be a good idea, and best practice, even if I was running on high RAM servers.

Solution

I'm guessing that memory is being used by mongodb itself, not mongoimport. Mongodb by design tries to keep all of its data into memory and relies on the OS to swap the memory-mapped files out when there's not enough room. So I'd give you two pieces of advice:

Don't worry too much about what your OS is telling you about how much memory is "free" -- a modern well-running OS will generally use every bit of RAM available for something.
If you can't abide by #1, don't run mongodb on your laptop.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow