Вопрос

I am using ElasticSearch to index some data. But I found that the performance is not that efficiency.

There are only 3000 entries data and each data has 6 columns. It costs 5 mins to index these 3000 entries.

Because I am new with ElasticSearch, my code and program flow are basic as following:

  1. search and check is there any same data with it.
  2. if there is same data, then update.
  3. If not, then add.

The code is following:

conn = pyes.ES('server:9200')

Search:

searchResult = conn.search(searchDict, indexName, TypeName)

Index

conn.index(storeDict, indexName, TypeName, id)

Update the Count in the index data.

 conn.partial_update(indexName, TypeName, id, "ctx._source.Count += counter", params={"counter" : 1})

Is there any method that can improve the performance of my code ?

Thank you for your help.

Это было полезно?

Решение

You don't need to search before updating. Read the es docs on updating and scroll down to the upsert section. upsert is a parameter which holds a document to use if the document does not exist on the server, otherwise the upsert is ignored and it works like a normal update request (as you are doing now).

Good luck!

Другие советы

  • You can use versioning feature of elasticsearch. If you are deciding your documents id's its pretty easy. It simply re-index the data.

  • You should use BULK API for indexing.(1000-5000 is good)

  • Another reason of bad performance is about configuration settings on config/elasticsearch.yml, you can use this hints to increase indexing performance.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top