Find ES bottleneck in bulk (bigdesk screenshots attached)

https://stackoverflow.com/questions/22802547

26-06-2023
|

Question

Updated: Beware long post

Before I go big and move to a bigger server I want to understand what's wrong with this one.

This is an MVP of an elasticsearch server in AWS (EC2). Two micro-s with just 600mb ram each.

I want to understand what's wrong with this configuration. As you can see there's a bottlenecking in the bulk command. The OS memory is quite full, heap memory is still low and although the process CPU is running at maximum, the OS cpu is low.

I reduced the complexity of each document in the bulk-feed and set unwanted field to not be indexed. The screenshots below is my last attempt.

Is it an I/O bottleneck? I store the data on a S3 bucket.

enter image description here

Server Info:

2 Nodes (one in each server), 3 indexes each of them running with 2 shards and 1 replica. So it's a primary node with a running backup one. Strangely "Iron man" node never took over a shard.

I run again the feeder with the above cluster state and the bottleneck seems to be on both nodes:

Here is the beginning of the feeder:

Primary:

PARENT

Secondary (secondary has the bottleneck):

After 5 minutes of feeding:

Primary (now primary has the bottleneck)

enter image description here

Secondary (secondary now is better):

enter image description here

I'm using py-elasticsearch so requests are auto-throttled in the streamer. However after the big bottleneck below it threw this error:

elasticsearch.exceptions.ConnectionError: 
ConnectionError(HTTPConnectionPool(host='IP_HERE', port=9200): 
Read timed out. (read timeout=10)) caused by: 
ReadTimeoutError(HTTPConnectionPool(host='IP_HERE', port=9200): 
Read timed out. (read timeout=10))

And here below is a very interesting screenshot on the same "bulk-feed". The Queue reached 20, the python threw the expression above and the refresh command runs until now that I'm writing.

My objective is to understand which source (CPU, RAM, Disk, Network...) is the inadequate or even better to use the existing sources more efficiently.

Solution 2

Can you run the IndexPerfES.sh script against index you are bulk indexing to, we can then see if the performance improves. I think that the refresh rate is degrading performance and is perhaps causing stress on the cluster, leading to problems. Let me know and we can work this out.

OTHER TIPS

So Nate's script was (among others) reducing the refresh interval. Let me add some other findings as well:

The refresh rate was stressing the cluster however I continued searching and found more "errors". One gotcha was that I have a deprecated S3.Gateway. S3 is persistent but slower than the EC2 volume.

Not only did I have S3 as data storage but on a different region (ec2 virginia -> s3 oregon). So sending documents over the network. I got down to that because some old tutorials have S3 as cloud data storage option.

After solving that, the "Documents deleted" below was better. When I was using S3 it was like 30%. This is from Elasticsearch HQ plugin.

FS Ops

Since now we have optimized I/O. Let's see what else we can do.

I found out that CPU is an issue. Although big desk says that the workload was minimal, t1.micros are not to used for persistent CPU usage. That means that although on the charts CPU it is not fully used that's because Amazon throttles it in intervals and in reality they are fully used.

If you put a big more complex documents it will stress the server.

Happy dev-oping.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow