MySQL: Updating large data set including time aggregation

https://dba.stackexchange.com/questions/285501

16-03-2021
|

Question

I am having a large MySQL (5.7) table with millions of rows (contains data for each second). Based on these values calculations should be performed, in some cases for large data ranges. Therefore, I want to perform a preprocessing where I perform the calculations beforehand and store the result in a separate table. Due to some internal reasons, aggregations are performed so, for example, I need the avaerage value of the formula for one hour. Here is an example:

CREATE TABLE `datavalues` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `ddate` DATE NOT NULL,
    `ttime` TIME NOT NULL,
    `unixtime` BIGINT(20) NOT NULL,
    `value1` DOUBLE,
    `value2` DOUBLE,
    PRIMARY KEY (`id`),
    INDEX `unixtime_idx` (`unixtime`) 
);
CREATE TABLE `calculation_result` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `ddate` DATE NOT NULL,
    `ttime` TIME NOT NULL,
    `unixtime` BIGINT(20) NOT NULL,
    `result1` DOUBLE,
    PRIMARY KEY (`id`),
    INDEX `unixtime_idx` (`unixtime`) 
);

As an exemplary calculation I calculate value1-value2. The initial data query for filling the calculation table with one value for each hour could look like this:

INSERT INTO calculation_result (ddate, ttime, unixtime, result1)
select max(ddate), STR_TO_DATE(TIME_FORMAT(max(ttime),'%H:00:00'),'%H:%i:%s'), 
UNIX_TIMESTAMP(adddate(max(ddate), INTERVAL HOUR(max(ttime)) HOUR)), AVG(value1 - value2)
FROM datavalues GROUP BY ddate, hour(ttime);

So far so good. However, I am struggling to implement a query for updating the result1 column in the calculation_result table in case the calculation description changes, for example, from value1 - value2 to value1 + value2. How can I efficiently update the result1 column? Deleting and recreating the calculation table is not an option as it contains a lot of further calculations in different columns.
Additionally, I expect the updating process to take some time. As data is newly imported on a regular basis, how can I prevent the calculation_result table to be locked? Maybe using bulk updates of data chunks?
Thanks for your help.

La solution

An UPDATE can be done by Joining the SELECT that you have, only you need some aliases for the columns.

The logic isd simple you can join both tables with the ddate and the time which is for example always 01:00:00 as You group by data and hour

UPDATE calculation_result cr
        INNER JOIN
    (SELECT 
        MAX(ddate) ddate,
            STR_TO_DATE(TIME_FORMAT(MAX(ttime), '%H:00:00'), '%H:%i:%s') AS ttime,
            UNIX_TIMESTAMP(ADDDATE(MAX(ddate), INTERVAL HOUR(MAX(ttime)) HOUR)) unixtime,
            AVG(value1 - value2) avgresult
    FROM
        datavalues
    GROUP BY ddate , HOUR(ttime)) t1 ON cr.ddate = t1.ddate
        AND cr.ttime = t1.ttime 
SET 
    cr.result1 = t1.result1 ,cr.unixtime = t1.unixtime

Autres conseils

If your event table is huge, what takes a while is reading it. Once the data is read, computing a few extra aggregates won't hurt a lot. And with one row per second in the event table and one row per hour in the aggregate table, you'll reduce data size substantially anyway. So you should basically store everything you'll need.

For example if you need avg(a-b) and avg(a+b), knowing that avg(a-b)=avg(a)-avg(b) and avg(a+b)=avg(a)+avg(b) the simplest option would be to just store avg(a) and avg(b) and you get the sum, difference, and any linear combination thereof for free when selecting from the aggregate table. Of course, that won't work if you use aggregates that cant' be combined this way. For example max(a-b) can't be calculated from max(a) and max(b).

As data is newly imported on a regular basis, how can I prevent the calculation_result table to be locked?

Use an engine that supports this, like InnoDB, for your aggregates table. If you don't want a long running update on your aggregation table, you could write the results of the aggregation in a temporary table, this takes no locks while the query runs. Then update the main table quickly.

If you have performance problems with your amount of data you could consider trying clickhouse, or another time series database. Clickhouse is easy to use if you know SQL, but it's not a general purpose database. If you use it for what it's good at, performance is just ridiculous, like 120 million rows/s for SELECT x,avg(y) GROUP BY x. It also has an AggregatingMerge table engine which automatically stores aggregates on larger time windows automatically, which is basically what you're asking. And since it compresses the tables, data files are tiny. It uses huge amounts of RAM though.

Licencié sous: CC-BY-SA avec attribution

Non affilié à dba.stackexchange