SQL Select Statement For Calculating A Running Average Column

https://stackoverflow.com/questions/911326

06-09-2019
|

Question

I am trying to have a running average column in the SELECT statement based on a column from the n previous rows in the same SELECT statement. The average I need is based on the n previous rows in the resultset.

Let me explain

Id        Number       Average
 1             1          NULL
 2             3          NULL
 3             2          NULL
 4             4             2 <----- Average of (1, 3, 2),Numbers from previous 3 rows
 5             6             3 <----- Average of (3, 2, 4),Numbers from previous 3 rows
 .             .             .
 .             .             .

The first 3 rows of the Average column are null because there are no previous rows. The row 4 in the Average column shows the average of the Number column from the previous 3 rows.

I need some help trying to construct a SQL Select statement that will do this.

Solution

This should do it:

--Test Data
CREATE TABLE    RowsToAverage
    (
    ID int NOT NULL,
    Number int NOT NULL
    )

INSERT  RowsToAverage(ID, Number)
SELECT  1, 1
UNION ALL
SELECT  2, 3
UNION ALL
SELECT  3, 2
UNION ALL
SELECT  4, 4
UNION ALL
SELECT  5, 6
UNION ALL
SELECT  6, 8
UNION ALL
SELECT  7, 10

--The query
;WITH   NumberedRows
AS
(
SELECT  rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM    RowsToAverage rta
)

SELECT  nr.ID, nr.Number,
        CASE
            WHEN nr.RowNumber <=3 THEN NULL
            ELSE (  SELECT  avg(Number) 
                    FROM    NumberedRows 
                    WHERE   RowNumber < nr.RowNumber
                    AND     RowNumber >= nr.RowNumber - 3
                )
        END AS MovingAverage
FROM    NumberedRows nr

OTHER TIPS

Assuming that the Id column is sequential, here's a simplified query for a table named "MyTable":

SELECT 
    b.Id,
    b.Number,
    (
      SELECT 
       AVG(a.Number) 
      FROM 
       MyTable a 
     WHERE 
       a.id >= (b.Id - 3) 
       AND a.id < b.Id
       AND b.Id > 3 
     ) as Average
FROM 
    MyTable b;

A simple self join would seem to perform much better than a row referencing subquery

Generate 10k rows of test data:

drop table test10k
create table test10k (Id int, Number int, constraint test10k_cpk primary key clustered (id))

;WITH digits AS (
    SELECT 0 as Number
    UNION SELECT 1
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
    UNION SELECT 5
    UNION SELECT 6
    UNION SELECT 7
    UNION SELECT 8
    UNION SELECT 9
)
,numbers as (
    SELECT 
        (thousands.Number * 1000) 
        + (hundreds.Number * 100) 
        + (tens.Number * 10) 
        + ones.Number AS Number
    FROM digits AS ones 
    CROSS JOIN digits AS tens
    CROSS JOIN digits AS hundreds
    CROSS JOIN digits AS thousands
)
insert test10k (Id, Number)
select Number, Number
from numbers

I would pull the special case of the first 3 rows out of the main query, you can UNION ALL those back in if you really want it in the row set. Self join query:

;WITH   NumberedRows
AS
(
    SELECT  rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
    FROM    test10k rta
)

SELECT  nr.ID, nr.Number,
    avg(trailing.Number) as MovingAverage
FROM    NumberedRows nr
    join NumberedRows as trailing on trailing.RowNumber between nr.RowNumber-3 and nr.RowNumber-1
where nr.Number > 3
group by nr.id, nr.Number

On my machine this takes about 10 seconds, the subquery approach that Aaron Alton demonstrated takes about 45 seconds (after I changed it to reflect my test source table) :

;WITH   NumberedRows
AS
(
    SELECT  rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
    FROM    test10k rta
)
SELECT  nr.ID, nr.Number,
    CASE
            WHEN nr.RowNumber <=3 THEN NULL
            ELSE (  SELECT  avg(Number) 
                            FROM    NumberedRows 
                            WHERE   RowNumber < nr.RowNumber
                            AND             RowNumber >= nr.RowNumber - 3
                    )
    END AS MovingAverage
FROM    NumberedRows nr

If you do a SET STATISTICS PROFILE ON, you can see the self join has 10k executes on the table spool. The subquery has 10k executes on the filter, aggregate, and other steps.

Edit: I missed the point that it should average the three previous records...

For a general running average, I think something like this would work:

SELECT
    id, number, 
    SUM(number) OVER (ORDER BY ID) / 
       ROW_NUMBER() OVER (ORDER BY ID) AS [RunningAverage]
FROM myTable
ORDER BY ID

Check out some solutions here. I'm sure that you could adapt one of them easily enough.

If you want this to be truly performant, and arn't afraid to dig into a seldom-used area of SQL Server, you should look into writing a custom aggregate function. SQL Server 2005 and 2008 brought CLR integration to the table, including the ability to write user aggregate functions. A custom running total aggregate would be the most efficient way to calculate a running average like this, by far.

Alternatively you can denormalize and store precalculated running values. Described here:

http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx

Performance of selects is as fast as it goes. Of course, modifications are slower.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow