scaling statsd with multiple servers

https://stackoverflow.com/questions/12871642

07-07-2021
|

Вопрос

I am laying out an architecture where we will be using statsd and graphite. I understand how graphite works and how a single statsd server could communicate with it. I am wondering how the architecture and set up would work for scaling out statsd servers. Would you have multiple node statsd servers and then one central statsd server pushing to graphite? I couldn't seem to find anything about scaling out statsd and any ideas of how to have multiple statsd servers would be appreciated.

Решение

I'm dealing with the same problem right now. Doing naive load-balancing between multiple statsds obviously doesn't work because keys with the same name would end up in different statsds and would thus be aggregated incorrectly.

But there are a couple of options for using statsd in an environment that needs to scale:

use client-side sampling for counter metrics, as described in the statsd documentation (i.e. instead of sending every event to statsd, send only every 10th event and make statsd multiply it by 10). The downside is that you need to manually set an appropriate sampling rate for each of your metrics. If you sample too few values, your results will be inaccurate. If you sample too much, you'll kill your (single) statsd instance.
build a custom load-balancer that shards by metric name to different statsds, thus circumventing the problem of broken aggregation. Each of those could write directly to Graphite.
build a statsd client that counts events locally and only sends them in aggregate to statsd. This greatly reduces the traffic going to statsd and also makes it constant (as long as you don't add more servers). As long as the period with which you send the data to statsd is much smaller than statsd's own flush period, you should also get similarly accurate results.
variation of the last point that I have implemented with great success in production: use a first layer of multiple (in my case local) statsds, which in turn all aggregate into one central statsd, which then talks to Graphite. The first layer of statsds would need to have a much smaller flush time than the second. To do this, you will need a statsd-to-statsd backend. Since I faced exactly this problem, I wrote one that tries to be as network-efficient as possible: https://github.com/juliusv/ne-statsd-backend

As it is, statsd was unfortunately not designed to scale in a manageable way (no, I don't see adjusting sampling rates manually as "manageable"). But the workarounds above should help if you are stuck with it.

Другие советы

Most of the implementations I saw use per server metrics, like: <env>.applications.<app>.<server>.<metric>

With this approach you can have local statsd instances on each box, do the UDP work locally, and let statsd publish its aggregates to graphite.

If you dont really need per server metrics, you have two choices:

Combine related metrics in the visualization layer (e.g.: by configuring graphiti to do so)
Use carbon aggregation to take care of that

If you have access to a hardware load balancer like a F5 BigIP (I'd imagine there are OSS software implementations that do this) and happen to have each host's hostname in your metrics (i.e. you're counting things like "appname.servername.foo.bar.baz" and aggregating them at the Graphite level) you can use source address affinity load balancing - it sends all traffic from one source address to the same destination node (within a reasonable timeout). So, as long as each metric name comes from only one source host, this will achieve the desired result.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow