Question

I am using ElasticSearch to index some data. But I found that the performance is not that efficiency.

There are only 3000 entries data and each data has 6 columns. It costs 5 mins to index these 3000 entries.

Because I am new with ElasticSearch, my code and program flow are basic as following:

  1. search and check is there any same data with it.
  2. if there is same data, then update.
  3. If not, then add.

The code is following:

conn = pyes.ES('server:9200')

Search:

searchResult = conn.search(searchDict, indexName, TypeName)

Index

conn.index(storeDict, indexName, TypeName, id)

Update the Count in the index data.

 conn.partial_update(indexName, TypeName, id, "ctx._source.Count += counter", params={"counter" : 1})

Is there any method that can improve the performance of my code ?

Thank you for your help.

Was it helpful?

Solution

You don't need to search before updating. Read the es docs on updating and scroll down to the upsert section. upsert is a parameter which holds a document to use if the document does not exist on the server, otherwise the upsert is ignored and it works like a normal update request (as you are doing now).

Good luck!

OTHER TIPS

  • You can use versioning feature of elasticsearch. If you are deciding your documents id's its pretty easy. It simply re-index the data.

  • You should use BULK API for indexing.(1000-5000 is good)

  • Another reason of bad performance is about configuration settings on config/elasticsearch.yml, you can use this hints to increase indexing performance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top