Question

We use Postgres for analytics (star schema). Every few seconds we get reports on ~500 metrics types. The simplest schema would be:

timestamp      metric_type     value
78930890       FOO              80.9
78930890       ZOO              20

Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:

timestamp   metric1     metric2     ...  metric500
78930890    90.9        20          ...  

Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).

Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?

EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours, and selecting min/max/avg per hour (or any otehr time period).

EDIT: the DBA arguments are:

  1. This is relevant from day 1 (see below) but even if was not this is something the system eventually will need to do and migrating from another schema will be a pain

  2. Reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization)

  3. When selecting multiple metrics the suggested schema will allow one pass over the data instead of separate query for each metric (or some complex combinations of OR and GroupBY)

EDIT: 500 metrics is an "upper bound" but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)

Was it helpful?

Solution

The DBA's suggestion isn't totally unreasonable if the metrics are fairly fixed, and make sense to group together. A couple of problems you'll likely face, though:

  • Postgres has a limit of between 250 and 1,600 columns (depending on data type)
  • The table will be hard for developers to work with, especially if you often want to query for only a subset of the attributes
  • Adding new columns will be slow

Instead, you might want to consider using an HSTORE column:

CREATE TABLE metrics (
    timestamp INTEGER,
    values HSTORE
)

This will give you some flexibility in storing attributes, and allows for indices. For example, to index just one of the metrics:

CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))

One drawback of this is that values can only be text strings… so if you need to do numeric comparisons, a JSON column might also be worth considering:

CREATE TABLE metrics (
    timestamp INTEGER,
    values JSON
)
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))

The drawback here is that you'll need to use Postgres 9.3, which is still reasonably new.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top