Question

I am trying to implement something along the lines of a Moving Average.

In this system, there are no guarantees of a quantity of Integers per time period. I do need to calculate the Average for each period. Therefore, I cannot simply slide over the list of integers by quantity as this would not be relative to time.

I can keep a record of each value with its associated time. We will have a ton of data running through the system so it is important to 'garbage collect' the old data.

It may also be important to note that I need to save the average to disk after the end of each period. However, they may be some overlap between saving the data to disk and having data from a new period being introduced.

What are some efficient data structures I can use to store, slide, and garbage collect this type of data?

Was it helpful?

Solution

The description of the problem and the question conflict: what is described is not a moving average, since the average for each time period is distinct. ("I need to compute the average for each period.") So that admits a truly trivial solution:

For each period, maintain a count and a sum of observations.

At the end of the period, compute the average

I suspect that what is actually wanted is something like: Every second (computation period), I want to know the average observation over the past minute (aggregation period).

This can be solved simply with a circular buffer of buckets, each of which represents the value for one computation period. There will be aggregation period / computation period such buckets. Again, each bucket contains a count and a sum. Also, a current total/sum and a cumulative total sum/count are maintained. Each observation is added to the current total/sum.

At the end of a each computation period:

  • subtract the sum/count for the (circularly) first period from the cumulative sum/count
  • add the current sum/count to the cumulative sum/count
  • report the average based on the cumulative sum/count
  • replace the values of the first period with the current sum/count
  • clear the current sum/count
  • advance the origin of the circular buffer.

If you really need to be able to compute at any time at all the average of the previous observations over some given period, you'd need a more complicated data structure, basically an expandable circular buffer. However, such precise computations are rarely actually necessary, and a bucketed approximation, as per the above algorithm, is usually adequate for data purposes, and is much more sustainable over the long term for memory management, since its memory requirements are fixed from the start.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top