سؤال

I have a table which contains analytical information, i.e.: Page Views on each page.

field        type
---------------------------
page_id          long
created_time     long (epoch UTC - rounded by hour)
page_views       long

I round down the epoch to hour (ex: 1398456553 ==> 1398456000) allowing this table to have aggregated information per hour.

When a client request their data, we can make the proper adjustments allowing him/her to see data in their local timezone.

If client's local timezone is UTC, the query is simple:

SELECT
    FROM_UNIXTIME(st.`created_time`, '%Y-%m-%d') AS created_at,
    SUM(st.`page_views`) AS page_views
FROM `page_stats` st
WHERE st.`created_time` 
    BETWEEN 1396310400 -- 01 Apr 2014 00:00:00 GMT
    AND 1397088000 -- 10 Apr 2014 00:00:00 GMT
GROUP BY created_at;

If client's timezone is someplace else (ex: -03:00), the query requires a little bit more manipulation, to adjust the dates to correct TZ:

SELECT
    DATE_FORMAT(CONVERT_TZ(FROM_UNIXTIME(st.`created_time`), '+00:00', '-03:00'), '%Y-%m-%d') AS created_at,
    SUM(st.`page_views`) AS page_views
FROM `page_stats` st
WHERE st.`created_time` 
    BETWEEN 1396321200 -- 01 Apr 2014 03:00:00 GMT
    AND 1397098800 -- 10 Apr 2014 03:00:00 GMT
GROUP BY created_at;

This approach works just fine for small periods (< 30days), but it scales poorly when the date range represents several months, because the number of rows to be selected and also because transformation need by functions like DATE_FORMAT.

The ideal data granularity is DAY, but I can't create an aggregated table by day because the rollup by day differs on each TZ.

What should be the proper way to model tables to provide TZ fidelity on large datasets?

Its noteworthy I can allow some error (< 2%) on this group by, maybe some Probabilist Data Structure may help to solve the problem but I couldn't figure out yet.

هل كانت مفيدة؟

المحلول

First, note that TimeZone != Offset. See the timezone tag wiki.

Second, if you are aggregating by the target date in multiple time zones, you may want to just pick a handful of relevant time zones and precompute their local dates into unique columns in your data. Then it will be easy to aggregate at time of query. Of course, this strategy doesn't hold up if you want to support all 500+ time zones in the IANA tzdb.

Another strategy would be to round to build another set of tables that pre-aggregates items into 15 minute buckets. Why 15 minutes? Because not all time zone offsets are in terms of whole hours. Consider -4:30 used in Venezuela, +5:30 used by India, +5:45 used in Nepal, and +8:45 used in parts of Australia. Once you have these pre-aggregates, you can transform those to the details of the specific client timezone at time of query.

And finally, you might consider that a relational database like MySQL may just not be the best tool for this particular job. An OLAP cube would work quite well, and so might a map/reduce function in any of several nosql databases. You may want to replicate your data from MySQL to a separate "reporting store" or "data warehouse", and query from there.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top