Performance of rare SELECT vs. frequent INSERT in timeseries data

https://dba.stackexchange.com/questions/115391

29-09-2020
|

Question

I have a simple timeseries table

movement_history (
    data_id serial,
    item_id character varying (8),
    event_time timestamp without timezone,
    location_id character varying (7),
    area_id character varying (2)
);

My frontend developer is telling me that the cost is too high if he wants to know where an item is at a given timestamp because he has to sort the table. He wants me to add another timestamp field for the next event so he doesn't have to sort. Yet that is going to more than double the cost of my code to insert a new movement as I will need to query for the previous entry for the item, update that and then insert the new data.

My inserts of course far outnumber his queries in frequency. And I have never seen a timeseries table which included an entry for the time of the next event. He's telling me my table is broken because his infrequent query requires a sort. Any suggestions?

I don't know what query he is using but I would be doing this:

select * from movement_history 
where event_time <= '1-15-2015'::timestamp  
and item_id = 'H665AYG3' 
order by event_time desc limit 1;

We currently have about 15K items they are at most entered into the database once a day. However we will soon have 50K of items with sensor data that is updated every 1 to 5 minutes.

I do not see his query being performed very often but another query to get the current status of the pallets will be.

select distinct on (item_id) * 
from movement_history 
order by item_id, event_time desc;

This server is currently running 9.3 but it could be running on 9.4 if it needs to.

Solution

Create an index on (item_id, event_time).

It will jump to the specified item_id, jump to the specified event_time for that item_id, and then move back one. No sorting involved.

OTHER TIPS

Conflicting solutions

You would need a multicolumn index like @jjanes provided. While being at it, you could make (item_id, event_time) the primary key to provide the index automatically.

But that's conflicting with write performance like @Michael explained: You double the cost for 50K of items ... updated every 1 to 5 minutes to make occasional SELECT queries cheaper. That's about 1 mio. rows per hour.

Partitioning

If you don't have more conflicting requirements, the compromise could be partitioning where the current partition does not have an index, yet. This way you get top write performance and (almost) top read performance.

The parent table could be movement_history, the current partition movement_history_current. No indexes, only one constraint to allow constraint exclusion. Could be daily partitions per default. But the time intervals can be anything, does not have to be regular, even. We can work with that and start a new partition whenever we need to.

When you need to include current data in said query, do the following:

To start a new partition, in one transaction:
- Rename the current partition by appending sth. to the name, like movement_history_20150110_20150115 (or more specific) and adjust the constraint on event_time.
- Create a new partition with the ever same name movement_history_current and a constraint on event_time that does not overlap with the last one and with open end.
- Depending on your access patterns you may have to deal with concurrent write access ...
Add a PK on (item_id, event_time) to the hew historic partition. Not in the same transaction. Creating the index in one piece is much cheaper than incrementally adding to it.

2a. To integrate advice for your second query below:
```
REFRESH MATERIALIZED VIEW mv_last_movement 
```
Run query. Actually, you can run the query any time. If it includes the current partition or any partition that doesn't have the index yet, it's slower for that partition.

Archive the oldest partitions from time to time. Just backup and delete the table. Does not interfere with ongoing operation much, that's the beauty of partitioning.

Read the manual first. There are caveats for inheritance and partitioning.

Your second query

The second query you added in an edit is the far bigger issue for performance. I am talking orders of magnitude:

select distinct on (item_id) * from movement_history
order by item_id, event_time desc;

Once you start inserting 1 mio. rows per hour, performance for this query will quickly deteriorate. You are dealing with many, many rows per item, DISTINCT ON is only good for few rows per item. Detailed explanation for DISTINCT ON and faster alternatives:

I still suggest partitioning like in my first answer. But enforce a new partition in reasonable intervals, so the current partition does not get too big.

In addition, create a "materilaized view" tracking the latest state for each item. It's not a standard MATERIALIZED VIEW because the defining query has a self-reference. I name it mv_last_movement and it has the same row type as movement_history.

Refresh whenever a new partition starts (see above).
Assuming the existence of an item table:

CREATE TABLE item (
  item_id varchar(8) PRIMARY KEY  -- should really be a serial 
  -- more columns?
);

If you don't have one, create it. Or use the alternative recursive CTE technique outlined in the answer linked above.

Init mv_last_movement once:

CREATE TABLE mv_last_movement AS
SELECT m.*
FROM   item i
,      LATERAL (
   SELECT *
   FROM   movement_history_current  -- current partition
   WHERE  item_id = i.item_id  -- lateral reference
   ORDER  BY event_time DESC
   LIMIT  1
   ) m;

ALTER TABLE mv_last_movement ADD PRIMARY KEY (item_id);

Then, to refresh (in a single transaction!):

BEGIN;

CREATE TABLE mv_last_movement2 AS
SELECT m.*
FROM   item i
,      LATERAL (
   (  -- parentheses required
   SELECT *
   FROM   movement_history_current  -- current partition
   WHERE  item_id = i.item_id  -- lateral reference
   ORDER  BY event_time DESC
   LIMIT  1  -- applied to this SELECT, not strictly needed but cheaper
   )
   UNION ALL  -- if not found, fall back to latest previous state
   SELECT *
   FROM   mv_last_movement  -- your materialized view
   WHERE  item_id = i.item_id  -- lateral reference
   LIMIT  1  -- applied to whole UNION query
   ) m;

DROP TABLE mv_last_movement;
ALTER TABLE mv_last_movement2 RENAME mv_last_movement;
ALTER TABLE mv_last_movement ADD PRIMARY KEY (item_id);

COMMIT;

Or similar. More details here:

Best way to populate a new column in a large table?

The very same query from above (bold emphasis) also replaces your original query cited at the top.

This way you don't have to inspect the whole history for items without current rows, which would be extremely expensive.

Why UNION ALL ... LIMIT 1?

Way to try multiple SELECTs till a result is available?

More advice

varchar for PK / FK columns is inefficient, especially for big tables with 1 mio rows per hour. Use integer keys instead.
Always use ISO format for date and timestamp literals or your queries depend on locale settings: '2015-15-01' instead of ~~'1-15-2015'~~.
Add NOT NULL constraints where the column can't be NULL.
Optimize your table layout to avoid space lost to padding
- Configuring PostgreSQL for read performance
  _{(The question titles "read performance", but that part is the same for write performance.)}

Often software design is a compromise between competing requirements. It is important to understand the relative merits, both for the system as a whole and each case locally. For example, you say writes outnumber reads. That would suggest the system as a whole should be optimised for writes. However, what are those reads for - do they prevent a vehicle collision or cardiac arrest? Perhaps those systems should be optimised for read.

Do you have a index on the time column? Then a query like select top (1) .. where time < parameter .. sorted desc should use that index. Essentially, you pre-sort the data for all queries.

The irony being that every write will have to maintain this index, doubling the cost each time.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange