Question

I have a table with the following data (paypal transactions):

    txn_type    |            date            |   subscription_id
----------------+----------------------------+---------------------
 subscr_signup  | 2014-01-01 07:53:20        | S-XXX01
 subscr_signup  | 2014-01-05 10:37:26        | S-XXX02
 subscr_signup  | 2014-01-08 08:54:00        | S-XXX03
 subscr_eot     | 2014-03-01 08:53:57        | S-XXX01
 subscr_eot     | 2014-03-05 08:58:02        | S-XXX02

I want to get the average subscription length overall for a given time period (subscr_eot is the end of a subscription). In the case of a subscription that is still ongoing ('S-XXX03') I want it to be included from it's start date until now in the average.
How would I go about doing this with an SQL statement in Postgres?

Was it helpful?

Solution

SQL Fiddle. Subscription length for each subscription:

select
    subscription_id,
    coalesce(t2.date, current_timestamp) - t1.date as subscription_length
from
    (
        select *
        from t
        where txn_type = 'subscr_signup'
    ) t1
    left join
    (
        select *
        from t
        where txn_type = 'subscr_eot'
    ) t2 using (subscription_id)
order by t1.subscription_id

The average:

select
    avg(coalesce(t2.date, current_timestamp) - t1.date) as subscription_length_avg
from
    (
        select *
        from t
        where txn_type = 'subscr_signup'
    ) t1
    left join
    (
        select *
        from t
        where txn_type = 'subscr_eot'
    ) t2 using (subscription_id)

OTHER TIPS

I used a couple of common table expressions; you can take the pieces apart pretty easily to see what they do.

One of the reasons this SQL is complicated is because you're storing column names as data. (subscr_signup and subscr_eot are actually column names, not data.) This is a SQL anti-pattern; expect it to cause you much pain.

with subscription_dates as (
  select 
      p1.subscription_id, 
      p1.date as subscr_start,
      coalesce((select min(p2.date) 
                from paypal_transactions p2
                where p2.subscription_id = p1.subscription_id
                  and p2.txn_type = 'subscr_eot'
                  and p2.date > p1.date), current_date) as subscr_end
  from paypal_transactions p1
  where txn_type = 'subscr_signup'
), subscription_days as (
  select subscription_id, subscr_start, subscr_end, (subscr_end - subscr_start) + 1 as subscr_days
  from subscription_dates 
)
select avg(subscr_days) as avg_days
from subscription_days
-- add your date range here.

avg_days
--
75.6666666666666667

I didn't add your date range as a WHERE clause, because it's not clear to me what you mean by "a given time period".

Using the window function lag(), this becomes considerably shorter:

SELECT avg(ts_end - ts) AS avg_subscr
FROM  (
   SELECT txn_type, ts, lag(ts, 1, localtimestamp)
                OVER (PARTITION BY subscription_id ORDER BY txn_type) AS ts_end
   FROM  t
   ) sub
WHERE txn_type = 'subscr_signup';

SQL Fiddle.

lag() conveniently takes a default value for missing rows. Exactly what we need here, so we don't need COALESCE in addition.

The query builds on the fact that subscr_eot sorts before subscr_signup.

Probably faster than presented alternatives so far because it only needs a single sequential scan - even though the window functions add some cost.

Using the column ts instead of date for three reasons:

  • Your "date" is actually a timestamp.
  • "date" is a reserved word in standard SQL (even if it's allowed in Postgres).
  • Never use basic type names as identifiers.

Using localtimestamp instead of now() or current_timestamp since you are obviously operating with timestamp [without time zone].

Also, your columns txn_type and subscription_id should not be text Maybe an enum for txn_type and integer for subscription_id. That would make table and indexes considerably smaller and faster.

For the query at hand, the whole table has to be read an indexes won't help - except for a covering index in Postgres 9.2+, if you need the read performance:

CREATE INDEX t_foo_idx ON t (subscription_id, txn_type, ts);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top