Question

I'm trying to perform a hierarchical clustering using a custom distance measure. I perform all the calculations in Python and then pass the data structures to R to do the clustering

import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')

m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m) 
hc=stats.hclust(new_dist_mat)

So my distance measures are held in a Python list, converted to an R matrix, which is then converted into a dist object required for the clustering. This works to an extent. However, when the matrix becomes too big and I get this error:

python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb

This occurs at the point where I convert to a dist object (as.dist). I haven't tested at what size it falls apart, but it works with 3000x3000 matrix, but fails with a 6500x6500 matrix, so somewhere in-between. I'm using the del function in Python to try remove any unnecessary objects from memory, but from what I've read this doesn't guarantee that the memory will become immediately available for use.

So, ultimately, is there a more memory efficient way to get a dist object? Or is there perhaps an alternative method I could use? I've found some other methods in R's cluster library, which do not use a dist object, but these methods use built-in distance metrics.

Thanks in advance!

Was it helpful?

Solution

Calling Python's del() does not guarantee that the memory is becoming immediately available for use. Calling the garbage collector explicitly helps. The answer to an other question here (Clearing memory used by rpy2) points to the relevant section in the rpy2 documentation.

Regarding clustering algorithms hierachical clustering with hclust() does require a "distance" matrix (of size n * (n + 1) / 2 ; R saves a bit of memory since the matrix is symetrical). There exists other clustering algorithms, or if keen on hierachical clustering tricks to minimize the size of the starting matrix by creating initial blocks, but that's outside the scope of a programming-related question.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top