Question

I have a graphing/analysis problem i can't quite get my head around. I can do a brute force, but its too slow, maybe someone has a better idea, or knows or a speedy library for python?

I have 2+ time series data sets (x,y) that i want to aggregate (and subsequently plot). The issue is that the x values across the series don't match up, and i really don't want to resort to duplicating values into time bins.

So, given these 2 series:

S1: (1;100) (5;100) (10;100)
S2: (4;150) (5;100) (18;150)

When added together, should result in:

ST: (1;100) (4;250) (5;200) (10;200) (18;250)

Logic:

x=1 s1=100, s2=None, sum=100
x=4 s1=100, s2=150, sum=250 (note s1 value from previous value)
x=5 s1=100, s2=100, sum=200
x=10 s1=100, s2=100, sum=200
x=18 s1=100, s2=150, sum=250

My current thinking is to iterate a sorted list of keys(x), keep the previous value for each series, and query each set if it has a new y for the x.

Any ideas would be appreciated!

Was it helpful?

Solution

Here's another way to do it, putting more of the behaviour on the individual data streams:

class DataStream(object):
    def __init__(self, iterable):
        self.iterable = iter(iterable)
        self.next_item = (None, 0)
        self.next_x = None
        self.current_y = 0
        self.next()

    def next(self):
        if self.next_item is None:
            raise StopIteration()
        self.current_y = self.next_item[1]
        try:
            self.next_item = self.iterable.next()
            self.next_x = self.next_item[0]
        except StopIteration:
            self.next_item = None
            self.next_x = None
        return self.next_item

    def __iter__(self):
        return self


class MergedDataStream(object):
    def __init__(self, *iterables):
        self.streams = [DataStream(i) for i in iterables]
        self.outseq = []

    def next(self):
        xs = [stream.next_x for stream in self.streams if stream.next_x is not None]
        if not xs:
            raise StopIteration()
        next_x = min(xs)
        current_y = 0
        for stream in self.streams:
            if stream.next_x == next_x:
                stream.next()
            current_y += stream.current_y
        self.outseq.append((next_x, current_y))
        return self.outseq[-1]

    def __iter__(self):
        return self


if __name__ == '__main__':
    seqs = [
        [(1, 100), (5, 100), (10, 100)],
        [(4, 150), (5, 100), (18, 150)],
        ]

    sm = MergedDataStream(*seqs)
    for x, y in sm:
        print "%02s: %s" % (x, y)

    print sm.outseq

OTHER TIPS

Something like this:

def join_series(s1, s2):
    S1 = iter(s1)
    S2 = iter(s2)
    value1 = 0
    value2 = 0
    time1, next1 = next(S1)
    time2, next2 = next(S2)
    end1 = False
    end2 = False

    while True:    
        time = min(time1, time2)
        if time == time1:
            value1 = next1
            try:
                time1, next1 = next(S1)
            except StopIteration:
                end1 = True
                time1 = time2

        if time == time2:
            value2 = next2
            try:
                time2, next2 = next(S2)
            except StopIteration:
                end2 = True
                time2 = time1

        yield time, value1 + value2

        if end1 and end2:
            raise StopIteration

S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))

for result in join_series(S1, S2):
    print(result)

It basically keeps the current value of S1 and S2, together with the next of S1 and S2, and steps through them based on which has the lowest "upcoming time". Should handle lists of different lengths to, and uses iterators all the way so it should be able to handle massive dataseries, etc, etc.

One possible approach:

  1. Format all series' element into tuples (x, y, series id), e.g. (4, 150, 1) and add them to a tuple list, and sort it by x ascending.

  2. Declare a list with length equals to number of series to maintain "last seen" value for each series.

  3. Iterate through each element tuple of list in step (1), and:

    3.1 Update the "last seen" list according to series id in tuple

    3.2 When x of previously iterated tuple doesn't match with x of current tuple, sum all element of "last seen" list and add the result to final list.

Now with my dirty test:

>>> 
S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))
>>> all = []
>>> for s in S1: all.append((s[0], s[1], 0))
...
>>> for s in S2: all.appned((s[0], s[1], 1))
...
>>> all
[(1, 100, 0), (5, 100, 0), (10, 100, 0), (4, 150, 1), (5, 100, 1), (18, 150, 1)]
>>> all.sort()
>>> all
[(1, 100, 0), (4, 150, 1), (5, 100, 0), (5, 100, 1), (10, 100, 0), (18, 150, 1)]
>>> last_val = [0]*2
>>> last_x = all[0][0]
>>> final = []
>>> for e in all:
...     if e[0] != last_x:
...             final.append((last_x, sum(last_val)))
...     last_val[e[2]] = e[1]
...     last_x = e[0]
...
>>> final.append((last_x, sum(last_val)))
>>> final
[(1, 100), (4, 250), (5, 200), (10, 200), (18, 250)]
>>>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top