Python alternate way to find dendrogram

https://stackoverflow.com/questions/12567813

03-07-2021
|

Question

I have data of dimension 8000x100. I need to cluster these 8000 items. I am more interested in the ordering of these items. I could get the desired result from the above code for small data but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object". Is there an alternate way to to get the reordered column from "Z".

from hcluster import pdist, linkage, dendrogram
import numpy
from numpy.random import rand

x = rand(8,100) # rand(8000,100) gives runtime error
Y = pdist(x)
Z = linkage(Y)
reorderedCol = dendrogram(Z)['ivl']


Traceback: 

>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> 

>>> x = rand(8000,100)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> reorderedCol = dendrogram(Z)['ivl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2062, in dendrogram
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info

...
...

  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2311, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2209, in _dendrogram_calculate_info
    _append_singleton_leaf_node(Z, p, n, level, lvs, ivl, leaf_label_func, i, labels)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2091, in _append_singleton_leaf_node
    ivl.append(str(int(i)))
RuntimeError: maximum recursion depth exceeded while getting the str of an object
>>>

Solution

The problem is that dendrogram is a visualization technique. At 8000 objects, it gets pretty much unreadable already. Which is why it probably wasn't optimized for this.

For larger data sets, I recommend going away from any kind of hierarchical cluster (which has when implemented with matrix operations an O(n^3) runtime, and for some cases you can do it in O(n^2)), and instead use e.g. OPTICS (Wikipedia) (and do not use OPTICS in Weka, or that python version that is floating around - afaict they are both incomplete!).

I cannot even run dendrogram, I get the error matplotlib not available. Plot request denied. So it probably does actually try to visualize the dendrogram! Which may well run out of memory if it puts a lot of effort into optimizing the visualization. By doing it yourself as I showed you in your other question Calculate ordering of dendrogram leaves you should be able to avoid this extra cost.

Is there a reason you are using hcluster instead of scipy.cluster.hierarchy?

OTHER TIPS

but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object"

The memory issue could be helped by using some form of dimensionality reduction technique like PCA or tSNE

Reduce from 100 dimensions to 20 or so

Running tSNE takes time so you could reduce from 100 dims to 50 dims(say) with PCA (faster) and then use tSNE to go to 10 or so dims.

Beware: these will lead to loss of data but might just get the job done.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow