Question

I query YouTube Data Api for a list of most popular videos on a channel and then get their statistics, 4 times per hour (each 15 minutes, by cron). The data is stored in Postgres, but dumping it and loading into another SQL DB wouldn't be a trouble. Now I have following table of data:

 video_id| views_count | likes_count | timestamp 
---------+-------------+-------------+---------------------
     foo | 100         | 1           | 2018-12-01 12:01:03
     foo | 101         | 1           | 2018-12-01 12:16:06
     foo | 105         | 1           | 2018-12-01 12:31:01
     bar | 199         | 0           | 2018-12-01 12:01:02
     bar | 200         | 0           | 2018-12-01 12:16:08
     bar | 301         | 5           | 2018-12-01 12:31:02
     ... | ...

UPD: Here's the schema (pasted to sqlfiddle):

CREATE TABLE video_statistics
(
  video_id TEXT not null,
  views_count INTEGER not null,
  likes_count INTEGER not null,
  timestamp TIMESTAMPTZ not null
);

How should I query that data in order to get increments by hour in view_counts and likes_count columns, grouped by video? To clarify what I want to get:

hour_of_day|video_id|views_increment|likes_increment
-----------+--------+---------------+---------------
     ...   | ...
     11    | foo    | 4             | 0
     12    | foo    | 5             | 1
     ...   | ...
     11    | bar    | 73            | 0
     12    | bar    | 102           | 5
     ...   | ...

In other words, it's a "best time to post video" based on historical data, taking into account data during many weeks and months. Should I rather dump the data into some timeseries DB or other, more appropriate for such cases DB, and query it there? Or should I just resort to calculating this in code?

Was it helpful?

Solution

One possibility is to first row_number() the records to get the first and last value per video, day and hour. Then join the two sets of first and last values to get the respective differences. Group the result on video and hour and get the sum or the average per video per day.

SELECT first.video_id,
       first.timestamp_hour,
       sum(last.views_count - first.views_count) views_count_diff_sum,
       sum(last.likes_count - first.likes_count) likes_count_diff_sum,
       avg(last.views_count - first.views_count) views_count_diff_avg,
       avg(last.likes_count - first.likes_count) likes_count_diff_avg
       FROM (SELECT video_id,
             timestamp_day,
             timestamp_hour,
             views_count,
             likes_count
             FROM (SELECT video_id,
                          timestamp::date timestamp_day,
                          date_part('hour', timestamp) timestamp_hour,
                          views_count,
                          likes_count,
                          row_number() OVER (PARTITION BY video_id,
                                                          timestamp::date,
                                                          date_part('hour', timestamp)
                                             ORDER BY timestamp ASC) rn
                          FROM elbat) first
             WHERE rn = 1) first
            INNER JOIN (SELECT video_id,
                               timestamp_day,
                               timestamp_hour,
                               views_count,
                               likes_count
                               FROM (SELECT video_id,
                                            timestamp::date timestamp_day,
                                            date_part('hour', timestamp) timestamp_hour,
                                            views_count,
                                            likes_count,
                                            row_number() OVER (PARTITION BY video_id,
                                                                            timestamp::date,
                                                                            date_part('hour', timestamp)
                                                               ORDER BY timestamp DESC) rn
                                            FROM elbat) last
                               WHERE rn = 1) last
                       ON last.video_id = first.video_id
                          AND last.timestamp_day = first.timestamp_day
                          AND last.timestamp_hour = first.timestamp_hour
       GROUP BY first.video_id,
                first.timestamp_hour;

OTHER TIPS

Schema:

create table T 
( video_id char(3) not null
, views_count int not null
, likes_count int not null
, ts timestamp not null
);

Guess something like:

select hr, video_id
     , lag(vc) over (partition by video_id
                     order by hr) - vc as vc_incr
     , lag(lc) over (partition by video_id
                     order by hr) - lc as lc_incr                
from (                          
    select extract(hour from ts) as hr
         , video_id
         , sum(views_count) as vc
         , sum(likes_count) as lc
    from t
    group by extract(hour from ts)
           , video_id
 ) as tt;

Note that you will have to decide what to do with rows that do not have a lag row, i.e. the first row in each partition.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top