Question

I was wondering if someone could help me with some SQL for returning the amount of unique users logged into a database table during a period of two or more days (let's use 7 days as a reference).

My log table contains a timestamp (ts) and user_id in each row, representing activity from that user at that time.

The following query returns the Daily Active Users or DAU from this log:

SELECT FLOOR(ts / 86400) AS day, COUNT(DISTINCT user_id) AS dau
FROM log
GROUP BY day ORDER BY day ASC

Now let's say I would like to add to this single query (or at least retrieve in the most efficient possible fashion) the Weekly Active Users, or total unique users logged for a period of 7 days. However, I don't want to divide my time in non-overlapping weeks. What I need is to count, for each day, the distinct user_ids seen during that day and the 6 previous days.

For example:

day users wau
1   1,2   2
4   1,3   3
7   3,4,5 5
8   5     4    (user_id 2 lost from count)
15  2     2    (user_ids 1,3,4 lost from count)

Thank you for any help you can provide and feel free to ask via comment if you need further clarification.

Was it helpful?

Solution

To get a "Weekly Average User" count (per my understanding of your specification... "for each day, the count of distinct user_ids seen during that day and the previous six days"), a query along the lines of the one below could be used. (The query also returns the "Daily Average User" count.

SELECT d.day
     , COUNT(DISTINCT u.user_id) AS wau
     , COUNT(DISTINCT IF(u.day=d.day,u.user_id,NULL)) AS dau
  FROM ( SELECT FLOOR(k.ts/86400) AS `day`
           FROM `log` k
          GROUP BY `day`
       ) d
  JOIN ( SELECT FLOOR(l.ts/86400) AS `day`
              , l.user_id
           FROM `log` l
          GROUP BY `day`, l.user_id
       ) u
    ON u.day <= d.day
   AND u.day > d.day - 7
 GROUP BY d.day
 ORDER BY d.day

(I have not yet run a test of this; but I will later, and I will update this statement if any corrections are needed.)

This query is joining the list of users for a given day (from the u rowsource), to a set of days from the log table (the d rowsource). Note the literal "7" that appears in the join predicate (the ON clause), that's what's getting the user list "matched" to the previous 6 days.

Note that this could also be extended to get the distinct user count over the past 3 days, for example, by adding another expression in the SELECT list.

     , COUNT(DISTINCT IF(u.day<=d.day AND u.day>d.day-3,u.user_id,NULL)) AS 3day

That literal "7" could be increased to get a larger range. And that literal 3 in the expression above could be changed to get any number of days... we just need to be sure we've got enough previous day rows (from d) joined to each row from u.

PERFORMANCE NOTE: Due to the inline views (or derived tables, as MySQL calls them), this query may not be very fast, since the resultsets for those inline views has to be materialized into intermediate MyISAM tables.

The inline view aliased as u may not be optimal; it might be faster to join directly to the log table. I was thinking in terms of getting a unique list of users for a given day, which is what that query in the inline view got me. It was just easier for me to conceptualize what was going on. And I was thinking that if you had hundreds of the same user entered for day, the inline view would weed out a whole bunch of the duplicates, before we did the join to the other days. A WHERE clause to limit the number of days we are returning would be best added inside the u and d inline views. (The d inline view would need to include an extra earlier 6 days.)


On another note, if ts column is TIMESTAMP datatype, I would be more inclined to use a DATE(ts) expression to extract the date portion. But that would return a DATE datatype in the resultset, rather than an integer, which would be different from the resultset you specified.)

SELECT d.day
     , COUNT(DISTINCT u.user_id) AS wau
     , COUNT(DISTINCT IF(u.day=d.day,u.user_id,NULL)) AS dau
  FROM ( SELECT DATE(k.ts) AS `day`
           FROM `log` k
          GROUP BY `day`
       ) d
  JOIN ( SELECT DATE(l.ts) AS `day`
              , l.user_id
           FROM `log` l
          GROUP BY `day`, l.user_id
       ) u
    ON u.day <= d.day
   AND u.day > DATE_ADD(d.day, INTERVAL -7 DAY)
 GROUP BY d.day
 ORDER BY d.day

OTHER TIPS

Here is another great example of why one should use date, datetime or timestamp field types to represent time values in the database rather than unix timestamps. Invariably, someone wants to actually query against the field and then you are left having to do a bunch of timestamp conversions, since integer timestamp values have no inherent concept of periods of time and you need to query based on periods of time. In the process, you lose any ability to utilize indexes on the fields.

At any rate, that is a pretty complex query you are looking to do. There might be a better way than what I am suggesting, but hopefully what I am suggesting at least makes sense. In this approach, you would perform a Cartesian join by joining the table to itself. You then limit the number of records by using an ON condition to make sure the dates in the second log table are within the seven day period of the date in the first log table. Finally, you do your aggregation and grouping. The query might look like this:

SELECT DATE(FROM_UNIXTIME(log1.ts)) as `day`, COUNT(DISTINCT log2.user_id) as `dau`
FROM log AS log1
INNER JOIN log AS log2
ON DATE(FROM_UNIXTIME(log2.ts)) <= DATE(FROM_UNIXTIME(log1.ts))
AND DATE(FROM_UNIXTIME(log2.ts)) >= DATE_SUB(DATE(FROM_UNIXTIME(log1.ts)), INTERVAL 7 DAY)
GROUP BY `day`
ORDER BY `day` ASC

A warning though. If you have any decently significant number of log entries, this query will take a long time to run as you are are going to be multiplying the number of records in the result set by some factor and you won't be using indexes.

Your best bet might be to actually create a new date format column in the table and run an update to populate the value. Make sure you have an index on that field. Then your query could look like this:

SELECT log1.date_field as `day`, COUNT(DISTINCT log2.date_field) as `dau`
FROM log AS log1
INNER JOIN log AS log2 
ON log2.date_field <= log1.date_field
AND log2.date_field >= DATE_SUB(log1.date_field, INTERVAL 7 DAY)
GROUP BY `day`
ORDER BY `day` ASC

You could then populate this field on all log entries going forward.

This is simple and straightforward to get the users who are active for the entire week:

select yearweek(ts) as yearwk, user_id, count(user_id) as weeklyactiveusers from log group by 1,2 having count(user_id) =7;

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top