Naming statsd metrics for short lived streams

Question

What guidance would you give on naming the metrics?

Graphite recommends that "Volatile path components should be kept as deep into the hierarchy as possible". This essentially means that if you can push the parts of the metrics that are frequently unique to the end of the "bucket" without impacting your grouping queries you should try to do so.

Here is a great post on using Graphite that includes naming recommendations. And here is another one with additional info from Jason Dixon (an excellent source for Graphite stuff in general).

Is it kosher to have metrics that have "dynamic" identifiers as part of the name?

I usually try to avoid identifiers in the metric names unless they are very low in number (<100). Because Graphite will store a .wsp file for every metric name you'll have a difficult time re-sizing or adjusting the storage settings should you decide to change your configuration. Additionally, the Graphite UI will have a "folder" for every metric name so you can easily make the UI unusable.

In your case, I'd probably graph the total number of game instances, the total number of players, and the number of errors (by type), etc. Additionally, I might try to track players per instance (generally) and maybe errors per instance (again without knowing the actual instance. e.g. GameTitle.RealTime.PerInstance.VoiceErrors) if I had that capability (i.e. state stored per instance in my application).

Logstash, Elastic Search, Kibana

I'd suggest logging this error information with instance and player ids and using logstash to send your logs to elastic search and kibana. Then I'd watch Graphite for real time error and health anomaly detection and use Kibana (and Elastic Search underneath) to dig deeper.

What are the scale limits on this. If I had a 100K game instances with say as many as 1000 players in a game, is this going to kill statsd/graphite?

Statsd should have no problem with this, as it just acts as a -mostly- dumb aggregator. While it does maintain some state internally I don't anticipate a problem.

I don't think you'll have problems with the internal Graphite Whisper Storage itself, as it is just using files and folders. But, as I mentioned above, the Graphite Web UI will be unusable and I think you'll also run the risk of other manageability issues.

Summary

Keep the volatile (dynamic) metric buckets at the end of the name and avoid going above a couple hundred of these.