Performance of rare SELECT vs. frequent INSERT in timeseries data
-
29-09-2020 - |
Question
I have a simple timeseries table
movement_history (
data_id serial,
item_id character varying (8),
event_time timestamp without timezone,
location_id character varying (7),
area_id character varying (2)
);
My frontend developer is telling me that the cost is too high if he wants to know where an item is at a given timestamp because he has to sort the table. He wants me to add another timestamp field for the next event so he doesn't have to sort. Yet that is going to more than double the cost of my code to insert a new movement as I will need to query for the previous entry for the item, update that and then insert the new data.
My inserts of course far outnumber his queries in frequency. And I have never seen a timeseries table which included an entry for the time of the next event. He's telling me my table is broken because his infrequent query requires a sort. Any suggestions?
I don't know what query he is using but I would be doing this:
select * from movement_history
where event_time <= '1-15-2015'::timestamp
and item_id = 'H665AYG3'
order by event_time desc limit 1;
We currently have about 15K items they are at most entered into the database once a day. However we will soon have 50K of items with sensor data that is updated every 1 to 5 minutes.
I do not see his query being performed very often but another query to get the current status of the pallets will be.
select distinct on (item_id) *
from movement_history
order by item_id, event_time desc;
This server is currently running 9.3 but it could be running on 9.4 if it needs to.
Solution
Create an index on (item_id, event_time)
.
It will jump to the specified item_id, jump to the specified event_time for that item_id, and then move back one. No sorting involved.
OTHER TIPS
Conflicting solutions
You would need a multicolumn index like @jjanes provided. While being at it, you could make (item_id, event_time)
the primary key to provide the index automatically.
But that's conflicting with write performance like @Michael explained: You double the cost for 50K of items ... updated every 1 to 5 minutes
to make occasional SELECT
queries cheaper. That's about 1 mio. rows per hour.
Partitioning
If you don't have more conflicting requirements, the compromise could be partitioning where the current partition does not have an index, yet. This way you get top write performance and (almost) top read performance.
The parent table could be movement_history
, the current partition movement_history_current
. No indexes, only one constraint to allow constraint exclusion. Could be daily partitions per default. But the time intervals can be anything, does not have to be regular, even. We can work with that and start a new partition whenever we need to.
When you need to include current data in said query, do the following:
To start a new partition, in one transaction:
- Rename the current partition by appending sth. to the name, like
movement_history_20150110_20150115
(or more specific) and adjust the constraint onevent_time
. - Create a new partition with the ever same name
movement_history_current
and a constraint onevent_time
that does not overlap with the last one and with open end. - Depending on your access patterns you may have to deal with concurrent write access ...
- Rename the current partition by appending sth. to the name, like
Add a PK on
(item_id, event_time)
to the hew historic partition. Not in the same transaction. Creating the index in one piece is much cheaper than incrementally adding to it.2a. To integrate advice for your second query below:
REFRESH MATERIALIZED VIEW mv_last_movement
Run query. Actually, you can run the query any time. If it includes the current partition or any partition that doesn't have the index yet, it's slower for that partition.
Archive the oldest partitions from time to time. Just backup and delete the table. Does not interfere with ongoing operation much, that's the beauty of partitioning.
Read the manual first. There are caveats for inheritance and partitioning.
Your second query
The second query you added in an edit is the far bigger issue for performance. I am talking orders of magnitude:
select distinct on (item_id) * from movement_history order by item_id, event_time desc;
Once you start inserting 1 mio. rows per hour, performance for this query will quickly deteriorate. You are dealing with many, many rows per item, DISTINCT ON
is only good for few rows per item. Detailed explanation for DISTINCT ON
and faster alternatives:
I still suggest partitioning like in my first answer. But enforce a new partition in reasonable intervals, so the current partition does not get too big.
In addition, create a "materilaized view" tracking the latest state for each item. It's not a standard MATERIALIZED VIEW
because the defining query has a self-reference. I name it mv_last_movement
and it has the same row type as movement_history
.
Refresh whenever a new partition starts (see above).
Assuming the existence of an item
table:
CREATE TABLE item (
item_id varchar(8) PRIMARY KEY -- should really be a serial
-- more columns?
);
If you don't have one, create it. Or use the alternative recursive CTE technique outlined in the answer linked above.
Init mv_last_movement
once:
CREATE TABLE mv_last_movement AS
SELECT m.*
FROM item i
, LATERAL (
SELECT *
FROM movement_history_current -- current partition
WHERE item_id = i.item_id -- lateral reference
ORDER BY event_time DESC
LIMIT 1
) m;
ALTER TABLE mv_last_movement ADD PRIMARY KEY (item_id);
Then, to refresh (in a single transaction!):
BEGIN;
CREATE TABLE mv_last_movement2 AS
SELECT m.*
FROM item i
, LATERAL (
( -- parentheses required
SELECT *
FROM movement_history_current -- current partition
WHERE item_id = i.item_id -- lateral reference
ORDER BY event_time DESC
LIMIT 1 -- applied to this SELECT, not strictly needed but cheaper
)
UNION ALL -- if not found, fall back to latest previous state
SELECT *
FROM mv_last_movement -- your materialized view
WHERE item_id = i.item_id -- lateral reference
LIMIT 1 -- applied to whole UNION query
) m;
DROP TABLE mv_last_movement;
ALTER TABLE mv_last_movement2 RENAME mv_last_movement;
ALTER TABLE mv_last_movement ADD PRIMARY KEY (item_id);
COMMIT;
Or similar. More details here:
The very same query from above (bold emphasis) also replaces your original query cited at the top.
This way you don't have to inspect the whole history for items without current rows, which would be extremely expensive.
Why UNION ALL ... LIMIT 1
?
More advice
varchar
for PK / FK columns is inefficient, especially for big tables with 1 mio rows per hour. Useinteger
keys instead.Always use ISO format for date and timestamp literals or your queries depend on locale settings:
'2015-15-01'
instead of.'1-15-2015'
Add
NOT NULL
constraints where the column can't be NULL.Optimize your table layout to avoid space lost to padding
- Configuring PostgreSQL for read performance
(The question titles "read performance", but that part is the same for write performance.)
- Configuring PostgreSQL for read performance
Often software design is a compromise between competing requirements. It is important to understand the relative merits, both for the system as a whole and each case locally. For example, you say writes outnumber reads. That would suggest the system as a whole should be optimised for writes. However, what are those reads for - do they prevent a vehicle collision or cardiac arrest? Perhaps those systems should be optimised for read.
Do you have a index on the time column? Then a query like select top (1) .. where time < parameter .. sorted desc
should use that index. Essentially, you pre-sort the data for all queries.
The irony being that every write will have to maintain this index, doubling the cost each time.