Grouping data points into series

https://stackoverflow.com/questions/1549412

20-09-2019
|

Question

I have a series of data points (tuples) in a list with a format like:

points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]

The first item in each tuple is an integer and they are assured to be sorted. The second value in each tuple is an arbitrary string.

I need them grouped in lists by their first value in a series. So given an interval of 3, the above list would be broken into:

[['a', 'b', 'a', 'd'], ['c']]

I wrote the following function, which works fine on small data sets. However, it is inneficient for large inputs. Any tips on how to rewrite/optimize/mininize this so I can process large data sets?

def split_series(points, interval):
    series = []

    start = points[0][0]
    finish = points[-1][0]

    marker = start
    next = start + interval
    while marker <= finish:
        series.append([point[1] for point in points if marker <= point[0] < next])
        marker = next
        next += interval

    return series

Solution

For completeness, here's a solution with itertools.groupby, but the dictionary solution will probably be faster (not to mention a lot easier to read).

import itertools
import operator

def split_series(points, interval):
    start = points[0][0]

    return [[v for k, v in grouper] for group, grouper in
            itertools.groupby((((n - start) // interval, val)
                               for n, val in points), operator.itemgetter(0))]

Note that the above assumes you've got at least one item in each group, otherwise it'll give different results from your script, i.e.:

>>> split_series([(1, 'a'), (2, 'b'), (6, 'a'), (6, 'd'), (11, 'c')], 3)
[['a', 'b'], ['a', 'd'], ['c']]

instead of

[['a', 'b'], ['a', 'd'], [], ['c']]

Here's a fixed-up dictionary solution. At some point the dictionary lookup time will begin to dominate, but maybe it's fast enough for you like this.

from collections import defaultdict

def split_series(points, interval):
    offset = points[0][0]
    maxval = (points[-1][0] - offset) // interval
    vals = defaultdict(list)
    for key, value in points:
        vals[(key - offset) // interval].append(value)
    return [vals[i] for i in xrange(maxval + 1)]

OTHER TIPS

Your code is O(n²). Here's an O(n) solution:

def split_series(points, interval):
    series = []
    current_group = []
    marker = points[0][0]
    for value, data in points:
        if value >= marker + interval:
            series.append(current_group)
            current_group = []
            marker += interval
        current_group.append(data)

    if current_group:
        series.append(current_group)

    return series

points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
print split_series(points, 3)  # Prints [['a', 'b', 'a', 'd'], ['c']]

One way to do it (no promises on speed):

Break your list of tuples into two lists: [1,2,2,3,4] and ['a','b','a','d','c']

Since the first list is sorted, you can just keep iterating over it until you get to an element out of the range. Then, you know the indexes of the start and end elements so you can just slice the strings out of second array. Continue until you've got all the intervals.

I'm not sure how efficient that'll be with tradition Python lists, but if your dataset is large enough, you could try using a NumPy array, which will slice really quickly.

From your code, I'm assuming my prior comment is correct. The problem here appears to be that the performance is O(n^2) - you repeat the list comprehension (which iterates all items) multiple times.

I say, use a simple for loop. If the current item belongs in the same group as the previous one, add it to the existing inner list [["a"], ["b"]] -> [["a"], ["b", "c"]]. If it doesn't, add it to a new inner list, perhaps adding empty padding lists first.

Expanding on Am's answer, use a defaultdict, and floor-divide the key by the interval to break them up correctly.

from collections import defaultdict
def split_series(points, interval):
    vals = defaultdict(list)
    for key, value in points:
        vals[(key-1)//interval].append(value)
    return vals.values()

Here's a lazy approach that uses the step behavior of xrange:

def split_series(points, interval):
    end_of_chunk = interval
    chunk = []
    for marker, item in points:
        if marker > end_of_chunk:
            for end_of_chunk in xrange(end_of_chunk, marker, interval):
                yield chunk
                chunk = []
            end_of_chunk += interval
        chunk.append(item)
    yield chunk

How about using iterators for lazy evaluation?

This should be the equivalent of your initial solution:

from itertools import groupby

def split_series(points, interval):
    """
    >>> points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
    >>> print list(split_series(points, 3))
    [['a', 'b', 'a', 'd'], ['c']]
    """

    def interval_key(t):
        return (t[0] - points[0][0]) // interval

    groups = groupby(points, interval_key)

    for group in groups:
        yield [v for _, v in group[1]]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow