Question

I have a star schema database in Redshift. I'm running an aggregation on a fact table order_facts with the following relevant columns:

total - FLOAT - total order cost

payment_date - INTEGER - payment date in the format YYYYMMDD

shop_id - INTEGER foreign key to the shops dimension

I'm using the following dimension tables:

DATES

id - INTEGER - date in format YYYYMMDD

date - DATE - SQL date formatted field

SHOPS

id - INTEGER - primary key for the table

created - DATETIME - timestamp for when the shop was created

My goal is to obtain a table of the total amount of orders for each shop in its first 30 days.

My query is this, and should give an accurate answer. However, it's taking over a half-hour to run:

SELECT of.shop_id, SUM(of.total) FROM order_facts of
INNER JOIN shops s
ON of.shop_id = s.id
INNER JOIN dates d
ON of.payment_date = d.id
WHERE d.date <= (s.created + INTERVAL '30 days')
GROUP BY of.shop_id

I tried rewriting it like this, but the query still does not finish after over 20 minutes of runtime

SELECT SUM(r.total), r.shop_id
FROM (
    SELECT s.created, of.shop_id, of.total, of.payment_date
    FROM order_facts of
    INNER JOIN shops s
    ON of.shop_id = s.id
) r
INNER JOIN dates d
ON r.payment_date = d.id
WHERE d.date <= (r.created + INTERVAL '30 days')
GROUP BY r.shop_id

I'm not currently understanding why it's taking so long to run. Understanding that would help me know how to correct the query. Likewise, seeing a better version of the above query would help me understand where the inefficiency exists. Either way is extremely helpful for me.

Just getting the total order amount per shop for the past X days, or all-time, is a very fast query. Thus it seems I'm doing something suboptimal with the join to the dates table, but it's not clear what.

EDIT: output of EXPLAIN

1
XN HashAggregate (cost=868301988.49..868301991.01 rows=1009 width=12)
2
-> XN Merge Join DS_DIST_NONE (cost=0.00..868286973.07 rows=3003084 width=12)
3
Merge Cond: ("outer".id = "inner".payment_date)
4
Join Filter: ("inner".shop_id = "outer".id)
5
-> XN Nested Loop DS_BCAST_INNER (cost=0.00..1773635510.00 rows=32472000 width=8)
6
Join Filter: (("outer".date)::timestamp without time zone > ("inner".created + '30 days'::interval))
7
-> XN Seq Scan on dates d (cost=0.00..110.00 rows=11000 width=8)
8
-> XN Seq Scan on shops s (cost=0.00..88.56 rows=8856 width=12)
9
-> XN Seq Scan on order_facts "of" (cost=0.00..90092.52 rows=9009252 width=16)
10
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
Was it helpful?

Solution

I made a subquery that built daily order totals by shop, then ran aggregations over that table. Execution time was ~6 seconds. Final query looked like this:

SELECT s.shop_name, rrr.first_thirty_day_total FROM (
    SELECT SUM(rr.daily_total) AS first_thirty_day_total, rr.shop_id FROM (
        SELECT d.date, r.daily_total, r.shop_id FROM (
            SELECT SUM(of.total) AS daily_total, of.shop_id, of.payment_date
            FROM order_facts of
            GROUP BY shop_id, payment_date
        ) r
        INNER JOIN dates d
        ON r.payment_date = d.id
    ) rr
    INNER JOIN shops s
    ON s.id = rr.shop_id
    WHERE rr.date <= s.created + INTERVAL '30 days'
    GROUP BY rr.shop_id
) rrr
INNER JOIN shops s
ON rrr.shop_id = s.id
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top