Should I duplicate columns between tables to speed-up aggregations like SUM? [closed]

https://dba.stackexchange.com/questions/274226

06-03-2021
|

Question

I have two tables in PostgreSQL 10.12 database:

waste_card (id (PK), user_id, manufacture_date, transfer_date, address_id, card_type, .. (and other card_specific_firelds)
wastes (user_id, waste_card_id(FK), amount, ... (other wastes_specific_fields)

I need to frequently list and count user wastes (2 separated API endpoints) by address and manufacture_date or transfer_date

Now to list items I fire 2 queries: 1- load user cards then 2- load wastes

SELECT waste_cards.* 
FROM   waste_cards 
WHERE  waste_cards.user_id = $1 
       AND (waste_cards.manufacture_date < '$2') 
       AND (waste_cards.address_id = $3) 
LIMIT $4

SELECT wastes.* 
FROM   wastes 
WHERE  wastes.waste_card_id IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)

To just counting amount I fire:

SELECT SUM(wastes.amount) 
FROM   waste_cards 
LEFT   OUTER JOIN wastes 
       ON wastes.waste_card_id = waste_cards.id 
WHERE  waste_cards.user_id = $1 
       AND (waste_cards.manufacture_date < '$2') 
       AND (waste_cards.address_id = $3)

Should I add [manufacture_date, transfer_date, address_id] fields to wastes to speed up queries to be like:

SELECT SUM(amount) 
FROM   wastes 
WHERE  user_id = $1 
       AND address_id = $2 
       AND manufacture_date < '$3'

I'm on very beginning with this system. Queries are less than 150 ms. Today its about 1K users with 100-200 cards each and 2-5 wastes on each card. Not much yet but we want to add other 10-15K users next month. So I want to ask if this db schema is correct or should I change it before go live on production.

Solution

DB design

Seems like a 1:n relationship between waste_card and wastes. Actual table definitions (CREATE TABLE statements) would clarify.

Do not repeat (duplicate) the columns user_id, manufacture_date, and address_id in the many-table wastes. That would bloat your table, introduce problems with maintenance, raise the question which is the true source of information, blatantly defy normalization etc.

Queries

While computing a total sum, just use a plain [INNER] JOIN. Wastecards with no related entries in waste don't change the result and can be excluded:

SELECT SUM(w.amount) AS total_waste
FROM   waste_card wc
JOIN   wastes     w  ON w.waste_card_id = wc.id
WHERE  wc.user_id = $1 
AND    wc.manufacture_date < $2
AND    wc.address_id = $3;

LEFT JOIN makes sense to ...

list the sum per entry in waste_card
and include cards with no entries in waste

But then it's typically better to use a LATERAL subquery computing the sum per card. The aggregate function guarantees exactly one row from the subquery so we can just as well switch to CROSS JOIN LATERAL (like Andriy pointed out):

SELECT wc.*  -- or better just the columns you need
     , w.sum_waste
FROM   waste_card wc
CROSS  JOIN LATERAL (
   SELECT SUM(w.amount) AS sum_waste
   FROM   wastes w
   WHERE  w.waste_card_id = wc.id
   ) w
WHERE  wc.user_id = $1 
AND    wc.manufacture_date < $2
AND    wc.address_id = $3;

See:

Covering index

To speed up either query, an index "covering" wastes.amount might pay (requires Postgres 11 or later):

CREATE INDEX your_idx_name ON wastes(waste_card_id) INCLUDE (amount);

For Postgres 10 or older, fall back to a multicolumn index:

CREATE INDEX your_idx_name ON wastes(waste_card_id, amount);

See:

Aside: LIMIT without ORDER BY (like in your first query) produces arbitrary results. Typically you want to add ORDER BY to determine results.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange