Indexing 10 million documents using Elasticsearch

https://stackoverflow.com/questions/23248038

08-07-2023
|

Question

I'm trying to index 10 million documents into my Elasticsearch index using the Elastica API. I'm running my script over an Ubuntu server with 16G RAM and 8 cores.

So far, I can't index more than 250000 docs. My script is breaking and returning an unknown error.

Can someone describe me the step to insure the indexing of this amount of data?

I have found a question similar to mine here, but the answers don't seem very clear to me.

EDIT :

I have ran the index script that Nate suggested from here and I got the following output :

close index
{
 "acknowledged" : true
}
refresh rate
{
 "acknowledged" : true
}
merge policy
{
 "acknowledged" : true
}
replicas
{
 "acknowledged" : true
}
flush
{
 "acknowledged" : true
}
buffer
{
 "acknowledged" : true
}
{
 "acknowledged" : true
}

PS: I slightly modified the script for more visibility on the output

EDIT 2: I have switched from Elastica to using the elasticsearch-jdbc-river and now it indexing around 5 millions but still not the whole database.

Here is the json file for the river and the script file for put it on Elasticsearch is here.

Solution

I have resolved this issue a long time ago, but I have forgotten to write an answer.

I have considered the second solution applying the elasticsearch-jdbc-river which is deprecated as from now when I am writing the answer.

Nevertheless, the problem with the river back then was with considering the default query_timeout option that seems not to be enough by default considering the heave SQL query I was using. The option killed the process after the query_timeout elapsed.

I have increased the query_timeout value and it solved my problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow