Python: Running Multidimensional Scaling with Incomplete Pairwise Dissimilarity Matrix in HDF5 format

https://stackoverflow.com//questions/25081536

02-01-2020
|

Question

I am working with large datasets of protein-protein similarities generated in NCBI BLAST. I have stored the results in a large pairwise matrices (25,000 x 25,000) and I am using multidimensional scaling (MDS) to visualize the data. These matrices were too large to work with in RAM so I stored them on disk in HDF5 format and accessed them with the h5py module.

The sklearn manifold MDS method generated great visualization for small-scale data in 3D, so that is the one I am currently using. For the calculation, it requires a complete symmetric pairwise dissimilarity matrix. However, with large datasets, a sort of "crust" is formed that obscures the clusters that have formed.

I think the problem is that I am required to input a complete dissimilarity matrix. Some proteins are not related to each other, but in the pairwise dissimilarity matrix, I am forced to input a default max value of dissimilarity. In the documentation of sklearn MDS, it says that a value of 0 is considered a missing value, but inputting 0 where I want missing values does not seem to work.

Is there any way of inputting an incomplete dissimilarity matrix so unrelated proteins don't have to be inputted? Or is there a better/faster way to visualize the data in a pairwise dissimilarity matrix?

La solution

MDS requires a full dissimilarity matrix AFAIK. However, I think it is probably not the best tool for what you plan to achieve. Assuming that your dissimilarity matrix is metric (which need not be the case), it surely can be embedded in 25,000 dimensions, but "crushing" that to 3D will "compress" the data points together too much. That results in the "crust" you'd like to peel away.

I would rather run a hierarchical clustering algorithm on the dissimilarity matrix, then sort the leaves (i.e. the proteins) so that the similar ones are kept together, and then visualize the dissimilarity matrix with rows and columns permuted according to the ordering generated by the clustering. Assuming short distances are colored yellow and long distances are blue (think of the color blind! :-) ), this should result in a matrix with big yellow rectangles along the diagonal where the similar proteins cluster together.

You would have to downsample the image or buy a 25,000 x 25,000 screen :-) but I assume you want to have an "overall" low-resolution view anyway.

Autres conseils

There are many algorithms under the name nonlineaer dimentionality reduction. You can find a long list of those algorithms on wikipedia, most of them are developed in recent years. If PCA doesn't work well for your data, I would try the method CCA or tSNE. The latter is especially good to show cluster structures.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow