I'd like to sum up moving averages for a number of different categories when storing log records. Imagine a service that saves web server logs one entry at a time. Let's further imagine, we don't have access to the logged records. So we see them once but don't have access to them later on.

For different pages, I'd like to know

  • the total number of hits (easy)
  • a "recent" average (like one month or so)
  • a "long term" average (over a year)

Is there any clever algorithm/data model that allows to save such moving averages without having to recalculate them by summing up huge quantities of data?

I don't need an exact average (exactly 30 days or so) but just trend indicators. So some fuzziness is not a problem at all. It should just make sure that newer entries are weighted higher than older ones.

One solution probably would be to auto-create statistics records for each month. However, I don't even need past month statistics, so this seems like overkill. And it wouldn't give me a moving average but rather swap to new values from month to month.

有帮助吗?

解决方案

An easy solution would be to keep an exponentially decaying total.

It can be calculated using the following formula:

newX = oldX * (p ^ (newT - oldT)) + delta

where oldX is the old value of your total (at time oldT), newX is the new value of your total (at time newT); delta is the contribution of new events to the total (for example the number of hits today); p is less or equal to 1 and is the decay factor. If we take p = 1, then we have the total number of hits. By decreasing p, we effectively decrease the interval our total describes.

其他提示

If all you really want is a smoothed value with a given time constant then the easiest thing is to use a single pole recursive IIR filter (aka AR or auto-regressive filter in time series analysis). This takes the form:

Xnew = k * X_old + (1 - k) * x

where X_old is the previous smoothed value, X_new is the new smoothed value, x is the current data point and k is a factor which determines the time constant (usually a small value, < 0.1). You may need to determine the two k values (one value for "recent" and a smaller value for "long term") empirically, based on your sample rate, which ideally should be reasonably constant, e.g. one update per day.

It may be solution for you.

You can aggregate data to intermediate storage grouped by hour or day. Than grouping function will work very fast, because you will need to group small amount of records and inserts will be fast as well. Precision decisions up to you.

It can be better than auto-correlated exponential algorithms because you can understand what you calculate easier and it doesn't require math each step.

For last term data you can use capped collections with limited amount of records. They supported natively by some DBs for example MongoDB.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top