Question

I have a large number of records (~1 billion) that I need to load into MongoDB (actually TokuMX, but whatever). I have about 6 different indices I need to create on the collection. Is it always faster to load the data, and then create the indices? When I look at Mongo's logfile, It seems like Mongo does some kind of large operation (maybe a row count?) before actually starting index creation, and it does this for every index I create.

Will it always be faster to create the indices after loading the data?

If I wait until after loading the data, would it be faster to create each index in the background at the same time rather creating them than one-by-one?

Was it helpful?

Solution

Back in the day we would bulk load our data in this way:

  1. Drop indexes
  2. Load data in the order for which the clustered index would be built (i.e., you export the data in a precise way)
  3. After the load is completed, create the clustered index
  4. Next, create any additional non-clustered indexes
  5. Miller time (this was before I could afford decent beer)

That method always proved faster than leaving the indexes in place. However, this was for Sybase and SQL Server. I imagine other systems would be similar, but I can't say for certain.

OTHER TIPS

If you are doing a large, load operation it is faster to utilize the TokuMX bulk loader, as it only requires one pass over the data to create both the primary key index and any secondary indexes. More information is available in the documentation at http://docs.tokutek.com/tokumx/tokumx-commands.html#tokumx-new-commands-loader

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top