How to cluster items in a list without specifying the number of clusters

https://stackoverflow.com/questions/22435001

15-06-2023
|

Question

My goal is to cluster the items in the list below based on distances between any two consecutive items. The number of clusters is not specified, only the maximum distance between any two consecutive items that must not be exceeded for them to be in the same cluster.

My attempt

import itertools
max_dist=20
_list=[1,48,52,59,89,94,103,147,151,165]
Ideal_result= [[1],[48,52,59],[89,94,103],[147,151,165]]

def clust(list_x, max_dist):
    q=[]
    for a, b in itertools.combinations(list_x,2):
        if b-a<=20:
            q.append(a),(b)
        else:continue
        yield q
print list(clust(_list,max_dist))

Output:

[[48,48,52,89,89,94,147,147,151],[48,48,52,89,89,94,147,147,151],..]`

The output is completely wrong, but I just wanted to include my attempt.

Any suggestions on how to get the ideal result? Thanks.

Solution

This passes your test:

def cluster(items, key_func):
    items = sorted(items)
    clusters = [[items[0]]]
    for item in items[1:]:
        cluster = clusters[-1]
        last_item = cluster[-1]
        if key_func(item, last_item):
            cluster.append(item)
        else:
            clusters.append([item])
    return clusters

Where key_func returns True if the current and previous items should be part of the same cluster:

>>> cluster([1,48,52,59,89,94,103,147,151,165], lambda curr, prev: curr-prev < 20)
[[1], [48, 52, 59], [89, 94, 103], [147, 151, 165]]

Another possibility would be to modify the "equivalent code" for itertools.groupby() to likewise take more than one argument for the key function.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow