Question

I have big mysql table containing daily metrics for large number of subjects. Here is the hypotetical schema:

day DATE
subject_id INT
metric1
metric2
metric3

What I want is to find top X subjects (by particular metric) for arbitrary date range. Something like this:

SELECT subject_id, SUM(metric1) 
FROM t1 
WHERE day BETWEEN '2018-05-01' AND '2018-05-15'
GROUP BY subject_id
ORDER BY SUM(metric1) DESC
LIMIT 10

Given the fact that the table contains 10M subjects, and the daily metrics for past 365 days, it roughly contains 3.6B rows. No matter how I index/partition it, there are still going to be scenarios which will make the query run for long time (e.g. user selecting past 365 days period). The goal is to have queries complete in few seconds so that they can be used to power dashboards in real-time.

I was trying to make this work with Amazon Aurora (MySql), but havent managed to optimize it to run nearly as fast as it is required.

Seems like the best options are BigQuery and Athena. Still, I was wondering if there are alternatives that are particularly tailored to this specific use case?

Are values updated, if so how often?

That is a good question. We have many datasets and for vast majority of them, the data is append-only. However, few larger dataset are updated 2-3 times over the course of first 60 days. Only 5% of the data is actually modified, while 95% of data remains the same as on the day of insert.

Was it helpful?

Solution

Amazon Athena would be a good choice for this application. However, queries might not complete in a few seconds. So, the solution might be to use Athena to generate aggregates and then load the results into a relational database to power your dashboard, AWS Glue can help with this data pipeline.

To generate aggregates in Athena:

You would create an S3 Bucket: s3://somebucket/

Then you would create tags formatted as such as your partitions:

s3://somebucket/date_partition=YYYY-MM-DD/

The goal is to have queries complete in few seconds so that they can be used to power dashboards in real-time.

With this requirement, I would likely pre-calculate aggregates for ranges relevant to data scientists and business users (last trailing year, last trailing month, last trailing year, ect) in Athena and then write them to a relational database. On AWS, Glue is useful for this sort of data pipelining). If historical data are not updated frequently (or at all), this process can be run as a daily batch process (as @Michael Kutz suggests).

See:

AWS Documentation » Amazon Athena » User Guide » Working with Source Data » Partitioning Data

1.1 Billion Taxi Rides on Amazon Athena

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top