Question

Suppose there are 14 objects, each of which have or do not have 1000 binary features. I have a 14x14 similarity matrix, but not the raw 14x1000 data. Is there a way to reconstruct or generate something similar to the raw data, given the similarity matrix?

I tried Monte Carlo simulations, but unconstrained they would take way too much time to achieve even a low level of consistency with the original similarity matrix.

I saw this relevant question: Similarity matrix -> feature vectors algorithm?. However, they wanted to reduce not increase dimensionality. Also, I am not sure (1) which matrix or matrices to use, and (2) how to convert into a binary matrix.

Était-ce utile?

La solution

It's impossible to say for sure unless you describe how the similarity scores were computed.

In general, for the usual kind of similarity scoring this is not possible: information has been lost in the transformation from individual features to aggregate statistics. The best you can hope to do is to arrive at a set of features that are consistent with the similarity scores.

I think that is what you are talking about when you say "similar to" the original. That problem is pretty interesting. Suppose similarity was computed as the dot-product of two feature vectors (ie the count of features for a pair of objects that both have value = 1/true). This is not the only choice: it is consistent with value of 0 (false) meaning no information. But it may generalize to other similarity measures.

In such a case, the problem is really a linear programming problem: a naive approach is to exhaustively search the space of possible objects - not randomly, but guided by the constraints. For example, suppose SIM(A,B) := similarity of object A and object B. Define an order on these vectors.

If SIM(A,B) = N, then choose A=B minimal (like (1,....,1 (N times), 0, .... 0 (1000-N times)), and then choose the minimum C s.t. (A,C), (B,C) have the given values. Once you find an inconsistency, backtrack, and increment.

This will find a consistent answer, although the complexity is very high (but probably better than monte carlo).

Finding a better algorithm is an interesting problem, but more than this I can't say in a SO post - that's probably a topic for a CS thesis!

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top