Is there a well established incremental algorithm to maintain a history of values with accumulation over specific time frames?

StackOverflow https://stackoverflow.com/questions/13943898

Question

I have practically completed one, but wanted to compare mine with a well researched an possibly academic algorithm. There may be a library of statistical objects which either directly or in-combination solve my particular need.

My system (which I intend to OpenSource) has a stream of NetFlow data. Rather than store in database and using SQL functions, I prefer to have a database-free system and maintain a set of statistics, updated for each new flow, and scrolled per-second (or higher).

My solution involves an single array of uint, to effectively create a jagged array of sizes [60, 59, 23, 6, ...], representing seconds, minutes, hours, days, weeks, etc....

Each slot contains the total amount of Bytes for that time. So after 60 seconds a single minute statistic is created as Avg(seconds). This of course continues relatively up the time scale.

Rather than simply having thousands of second increments, it is due to:

  1. Memory constraints and the potential to have more statistical nodes; AND
  2. Ideal presentation to users

...that I roll up time scales.

Given that a flow may be applied to several nodes in a heirarchy of statistics (WAN Link, IP Address, Destination Address, SourcePort-DestinationPort), I calculate the delta once (GenerateDelta) and then simply apply at every node which is both active and which matches the flow meta-data.

A statistic on a given node would be "scrolled", in the following potential cases:

  1. When being read/displayed (via HTTP\JSON AJAX Request)
  2. When a delta is being applied (due to relevant flow)
  3. Simply every n-seconds (n is typically 1)

Overall there may be a well established algorithm for keeping running totals over time (with seconds, minutes...). But failing that, there may also be a suitable algorithms for comparison on smaller sub-sections of my code:

  • GenerateDelta - not likely as this is specific for breaking down and averaging a flow with duration over slots in the statistics array.
  • Scroll - if there were only seconds, then this would of course be simple, however my solution requires the 60 seconds to be combined into a new minute total every 60 seconds, and so on.

I do not wish responders to suggest any of their own algorithms, I have already (almost) completed all of my own without any problems and with many performance considerations. And others will likely be able to have a look at my algorithm when I have finished and published as Open Source.

What I do wish to see is any "well established" algorithms for comparison. Perhaps mine will be better, perhaps mine will be worse. Google isn't good at this sort of question, I need your help.

Thanks!

Was it helpful?

Solution

Thanks to comment from @rici, I found the "Stream Statistics" domain is what is required. There are Data Stream Management Systems (DSMS) for dealing with stream statistics. Whereas SQL RDBMS systems can store data with statistics generated by SQL query, a Data Stream Management Systems, enables the processing of a continuous stream of data, given one or more queries.

This paper, describes a DSMS as:

  1. Being able to sacrifice quality for qualitative use
  2. Being single pass, because the data is vast
  3. Having Queries treating data as sequences not sets And more...

This one, depicts a diagram of such a DSMS, references the Network Traffic Analysis problem domain,

This paper, describes StreamSQL, SQL-like syntax, for defining continuous queries.

Even though proprietary solutions are not accessible. There certainly are well established algorithms. I can therefore test the performance of my specialised system against general stream query tools.

Several products/prototypes of DSMS can be found in this wiki page, specifically Odysseus is of interest, being Java based and open source.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top