Question

I have a large csv of similarities between keywords that I would like to convert it to a triangular distance matrix (because it is very large and sparse would be even better) to perform hierarchical clustering using scipy. My current data csv looks like:

a,  b, 1
b,  a, 1
c,  a, 2
a,  c, 2

I am not sure how to do this and I can't find any easy tutorials for clustering in python.

Thanks for any help!

Was it helpful?

Solution

There are two parts to this question:

  • How do you load distances from a CSV of this format into a (maybe sparse) triangular distance matrix?

  • Given a triangular distance matrix, how do you do hierarchical clustering with scipy?

How to load the data: I don't think scipy.cluster.hierarchy works with sparse data, so let's do it dense. I'm also going to do it into the full square matrix and then take the upper triangle that scipy wants, out of laziness; you could index directly into the compressed version if you were being more clever.

from collections import defaultdict
import csv
import functools
import itertools
import numpy as np

# name_to_id associates a name with an integer 0, 1, ...
name_to_id = defaultdict(functools.partial(next, itertools.count()))

with open('file.csv') as f:
    reader = csv.reader(f)

    # do one pass over the file to get all the IDs so we know how 
    # large to make the matrix, then another to fill in the data.
    # this takes more time but uses less memory than loading everything
    # in in one pass, because we don't know how large the matrix is; you
    # can skip this if you do know the number of elements from elsewhere.
    for name_a, name_b, dist in reader:
        idx_a = name_to_id[name_a]
        idx_b = name_to_id[name_b]

    # make the (square) distances matrix
    # this should really be triangular, but the formula for 
    # indexing into that is escaping me at the moment
    n_elem = len(name_to_id)
    dists = np.zeros((n_elem, n_elem))

    # go back to the start of the file and read in the actual data
    f.seek(0)
    for name_a, name_b, dist in reader:
        idx_a = name_to_id[name_a]
        idx_b = name_to_id[name_b]
        dists[(idx_a, idx_b) if idx_a < idx_b else (idx_b, idx_a)] = dist

condensed = dists[np.triu_indices(n_elem, 1)]

Then call e.g. scipy.cluster.hierarchy.linkage with condensed. To map from the indices back to names, you can use something like

id_to_name = dict((id, name) for name, id in name_to_id.iteritems())
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top