Algorithm to sum/stack values from a time series graph where data points don't match on time
-
12-10-2019 - |
Question
I have a graphing/analysis problem i can't quite get my head around. I can do a brute force, but its too slow, maybe someone has a better idea, or knows or a speedy library for python?
I have 2+ time series data sets (x,y) that i want to aggregate (and subsequently plot). The issue is that the x values across the series don't match up, and i really don't want to resort to duplicating values into time bins.
So, given these 2 series:
S1: (1;100) (5;100) (10;100)
S2: (4;150) (5;100) (18;150)
When added together, should result in:
ST: (1;100) (4;250) (5;200) (10;200) (18;250)
Logic:
x=1 s1=100, s2=None, sum=100
x=4 s1=100, s2=150, sum=250 (note s1 value from previous value)
x=5 s1=100, s2=100, sum=200
x=10 s1=100, s2=100, sum=200
x=18 s1=100, s2=150, sum=250
My current thinking is to iterate a sorted list of keys(x), keep the previous value for each series, and query each set if it has a new y for the x.
Any ideas would be appreciated!
Solution
Here's another way to do it, putting more of the behaviour on the individual data streams:
class DataStream(object):
def __init__(self, iterable):
self.iterable = iter(iterable)
self.next_item = (None, 0)
self.next_x = None
self.current_y = 0
self.next()
def next(self):
if self.next_item is None:
raise StopIteration()
self.current_y = self.next_item[1]
try:
self.next_item = self.iterable.next()
self.next_x = self.next_item[0]
except StopIteration:
self.next_item = None
self.next_x = None
return self.next_item
def __iter__(self):
return self
class MergedDataStream(object):
def __init__(self, *iterables):
self.streams = [DataStream(i) for i in iterables]
self.outseq = []
def next(self):
xs = [stream.next_x for stream in self.streams if stream.next_x is not None]
if not xs:
raise StopIteration()
next_x = min(xs)
current_y = 0
for stream in self.streams:
if stream.next_x == next_x:
stream.next()
current_y += stream.current_y
self.outseq.append((next_x, current_y))
return self.outseq[-1]
def __iter__(self):
return self
if __name__ == '__main__':
seqs = [
[(1, 100), (5, 100), (10, 100)],
[(4, 150), (5, 100), (18, 150)],
]
sm = MergedDataStream(*seqs)
for x, y in sm:
print "%02s: %s" % (x, y)
print sm.outseq
OTHER TIPS
Something like this:
def join_series(s1, s2):
S1 = iter(s1)
S2 = iter(s2)
value1 = 0
value2 = 0
time1, next1 = next(S1)
time2, next2 = next(S2)
end1 = False
end2 = False
while True:
time = min(time1, time2)
if time == time1:
value1 = next1
try:
time1, next1 = next(S1)
except StopIteration:
end1 = True
time1 = time2
if time == time2:
value2 = next2
try:
time2, next2 = next(S2)
except StopIteration:
end2 = True
time2 = time1
yield time, value1 + value2
if end1 and end2:
raise StopIteration
S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))
for result in join_series(S1, S2):
print(result)
It basically keeps the current value of S1 and S2, together with the next of S1 and S2, and steps through them based on which has the lowest "upcoming time". Should handle lists of different lengths to, and uses iterators all the way so it should be able to handle massive dataseries, etc, etc.
One possible approach:
Format all series' element into tuples (x, y, series id), e.g. (4, 150, 1) and add them to a tuple list, and sort it by x ascending.
Declare a list with length equals to number of series to maintain "last seen" value for each series.
Iterate through each element tuple of list in step (1), and:
3.1 Update the "last seen" list according to series id in tuple
3.2 When x of previously iterated tuple doesn't match with x of current tuple, sum all element of "last seen" list and add the result to final list.
Now with my dirty test:
>>>
S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))
>>> all = []
>>> for s in S1: all.append((s[0], s[1], 0))
...
>>> for s in S2: all.appned((s[0], s[1], 1))
...
>>> all
[(1, 100, 0), (5, 100, 0), (10, 100, 0), (4, 150, 1), (5, 100, 1), (18, 150, 1)]
>>> all.sort()
>>> all
[(1, 100, 0), (4, 150, 1), (5, 100, 0), (5, 100, 1), (10, 100, 0), (18, 150, 1)]
>>> last_val = [0]*2
>>> last_x = all[0][0]
>>> final = []
>>> for e in all:
... if e[0] != last_x:
... final.append((last_x, sum(last_val)))
... last_val[e[2]] = e[1]
... last_x = e[0]
...
>>> final.append((last_x, sum(last_val)))
>>> final
[(1, 100), (4, 250), (5, 200), (10, 200), (18, 250)]
>>>