Generate multiple running totals with GROUP BY day

https://dba.stackexchange.com/questions/265956

28-02-2021
|

Question

I have a set of transactions of stock purchases by users and I want to keep track of a running balance of each stock as the year progresses. I am using a windowing function to track the running balance but for some reason I cannot get the GROUP BY portion of this query to work.

It continues to have duplicate days in the result set even when I attempt to group by the date (created_at). Sample below:

select
  t.customer_id,
  t.created_at::date,
  sum(case when t.stock_ticker = 'tsla' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as tsla_running_amount,
  sum(case when t.stock_ticker = 'goog' then t.amount end) over (order by t.created_at::date rows unbounded preceding) as goog_running_amount,
from transactions t
group by t.created_at, t.customer_id, t.stock_ticker, t.amount
order by t.created_at desc;

Test setup:

CREATE TABLE transactions (
   transaction_id varchar(255) NOT NULL,
   amount float8 NOT NULL,
   stock_ticker varchar(255) NOT NULL,
   transaction_type varchar(255) NOT NULL,
   customer_id varchar NOT NULL,
   inserted_at timestamp NOT NULL,
   created_at timestamp NOT NULL,
   CONSTRAINT transactions_pkey PRIMARY KEY (transaction_id)
);

INSERT INTO transactions(transaction_id, amount, stock_ticker, transaction_type, customer_id, inserted_at, created_at)
VALUES
  ('123123abmk12', 10, 'tsla', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk13', 20, 'tsla', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00')
, ('123123abmk14',  5, 'goog', 'purchase', 'a1b2c3', '2020-04-01 01:00:00', '2020-04-01 01:00:00')
, ('123123abmk15',  8, 'goog', 'purchase', 'a1b2c3', '2020-04-03 01:00:00', '2020-04-03 01:00:00');

CREATE INDEX ix_transactions_customer_id ON transactions USING btree (customer_id);

The result here always comes back with multiple rows per day, when I want them to be grouped all into one day.

After doing some research I attempted to cast created_at to date in the GROUP BY clause as well, but I get this error:

Column t.created_at must appear in the GROUP BY clause or be used in an aggregate function

In addition, the results are only going to show days in which a transaction has happened for a user. I need to be able to show a row for each day in a time series (1 year) even if the user did not make a transaction that day. (Using the most recent running balance on the row instead.)

I think that generate_series() is the way to go, but I am having trouble understanding how to fit it in.

Solution

I see a couple of issues. This should do it:

SELECT * -- ⑥
FROM   (  -- ①
   SELECT the_day::date
   FROM   generate_series(timestamp '2020-01-01', date_trunc('day', localtimestamp), interval '1 day') the_day
   ) d 
LEFT   JOIN ( -- ②
   SELECT customer_id
        , created_at::date AS the_day -- ⑥
        , sum(sum(t.amount) FILTER (WHERE stock_ticker = 'tsla')) OVER w AS tsla_running_amount -- ③
        , sum(sum(t.amount) FILTER (WHERE stock_ticker = 'goog')) OVER w AS goog_running_amount
   FROM   transactions t
   WHERE  created_at >= timestamp '2020-01-01'  -- ④
   GROUP  BY customer_id, created_at::date  -- ⑤
   WINDOW w AS (PARTITION BY customer_id ORDER BY created_at::date) -- ③
   ) t USING (the_day) -- ⑥
ORDER  BY customer_id, the_day; -- ⑦

db<>fiddle here

① Generating all days of the current year in optimized fashion. See:

Generating time series between two dates in PostgreSQL

② The LEFT JOIN you had in mind. You may really want one row per combination of (customer_id, day). See:

Fill missing dates within groups

③ While you also GROUP BY customer_id the Window frame needs to lead with PARTITION BY customer_id.

Drop rows unbounded preceding. The default window frame should be just fine. The manual:

The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Use the aggregate FILTER clause instead of the CASE expression. Shorter, faster, cleaner. See:

Return counts for multiple ranges in a single SELECT statement

Finally wrap the result of the aggregation in another sum() for the window function. See:

Get the distinct sum of a joined table column

The separate WINDOW clause avoids spelling out the same repeatedly. No effect on performance.

④ The WHERE is logically redundant, but assuming there are older rows, too, it excludes irrelevant rows early, improving performance.
Asuming no or almost no future timestamps, so no upper bound.

⑤ You cannot have t.stock_ticker, t.amount in GROUP BY as you want one row per (customer_id, the_day).

⑥ Use the same column alias (the_day in my example) to allow the simple USING (the_day) in the JOIN clause, and SELECT * at the top level.

⑦ Not sure about the final sort order. This seems more useful to me. Adjust to your preference.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange