Should I duplicate columns between tables to speed-up aggregations like SUM? [closed]
-
06-03-2021 - |
Question
I have two tables in PostgreSQL 10.12 database:
waste_card
(id (PK), user_id, manufacture_date, transfer_date, address_id, card_type, .. (and other card_specific_firelds)wastes
(user_id, waste_card_id(FK), amount, ... (other wastes_specific_fields)
I need to frequently list and count user wastes (2 separated API endpoints) by address and manufacture_date or transfer_date
Now to list items I fire 2 queries: 1- load user cards then 2- load wastes
SELECT waste_cards.*
FROM waste_cards
WHERE waste_cards.user_id = $1
AND (waste_cards.manufacture_date < '$2')
AND (waste_cards.address_id = $3)
LIMIT $4
SELECT wastes.*
FROM wastes
WHERE wastes.waste_card_id IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
To just counting amount I fire:
SELECT SUM(wastes.amount)
FROM waste_cards
LEFT OUTER JOIN wastes
ON wastes.waste_card_id = waste_cards.id
WHERE waste_cards.user_id = $1
AND (waste_cards.manufacture_date < '$2')
AND (waste_cards.address_id = $3)
Should I add [manufacture_date, transfer_date, address_id] fields to wastes to speed up queries to be like:
SELECT SUM(amount)
FROM wastes
WHERE user_id = $1
AND address_id = $2
AND manufacture_date < '$3'
I'm on very beginning with this system. Queries are less than 150 ms. Today its about 1K users with 100-200 cards each and 2-5 wastes on each card. Not much yet but we want to add other 10-15K users next month. So I want to ask if this db schema is correct or should I change it before go live on production.
Solution
DB design
Seems like a 1:n relationship between waste_card
and wastes
. Actual table definitions (CREATE TABLE
statements) would clarify.
Do not repeat (duplicate) the columns user_id
, manufacture_date
, and address_id
in the many-table wastes
. That would bloat your table, introduce problems with maintenance, raise the question which is the true source of information, blatantly defy normalization etc.
Queries
While computing a total sum, just use a plain [INNER] JOIN
. Wastecards with no related entries in waste
don't change the result and can be excluded:
SELECT SUM(w.amount) AS total_waste
FROM waste_card wc
JOIN wastes w ON w.waste_card_id = wc.id
WHERE wc.user_id = $1
AND wc.manufacture_date < $2
AND wc.address_id = $3;
LEFT JOIN
makes sense to ...
- list the sum per entry in
waste_card
- and include cards with no entries in
waste
But then it's typically better to use a LATERAL
subquery computing the sum per card. The aggregate function guarantees exactly one row from the subquery so we can just as well switch to CROSS JOIN LATERAL
(like Andriy pointed out):
SELECT wc.* -- or better just the columns you need
, w.sum_waste
FROM waste_card wc
CROSS JOIN LATERAL (
SELECT SUM(w.amount) AS sum_waste
FROM wastes w
WHERE w.waste_card_id = wc.id
) w
WHERE wc.user_id = $1
AND wc.manufacture_date < $2
AND wc.address_id = $3;
See:
Covering index
To speed up either query, an index "covering" wastes.amount
might pay (requires Postgres 11 or later):
CREATE INDEX your_idx_name ON wastes(waste_card_id) INCLUDE (amount);
For Postgres 10 or older, fall back to a multicolumn index:
CREATE INDEX your_idx_name ON wastes(waste_card_id, amount);
See:
Aside: LIMIT
without ORDER BY
(like in your first query) produces arbitrary results. Typically you want to add ORDER BY
to determine results.