質問

Suppose there are 14 objects, each of which have or do not have 1000 binary features. I have a 14x14 similarity matrix, but not the raw 14x1000 data. Is there a way to reconstruct or generate something similar to the raw data, given the similarity matrix?

I tried Monte Carlo simulations, but unconstrained they would take way too much time to achieve even a low level of consistency with the original similarity matrix.

I saw this relevant question: Similarity matrix -> feature vectors algorithm?. However, they wanted to reduce not increase dimensionality. Also, I am not sure (1) which matrix or matrices to use, and (2) how to convert into a binary matrix.

役に立ちましたか?

解決

It's impossible to say for sure unless you describe how the similarity scores were computed.

In general, for the usual kind of similarity scoring this is not possible: information has been lost in the transformation from individual features to aggregate statistics. The best you can hope to do is to arrive at a set of features that are consistent with the similarity scores.

I think that is what you are talking about when you say "similar to" the original. That problem is pretty interesting. Suppose similarity was computed as the dot-product of two feature vectors (ie the count of features for a pair of objects that both have value = 1/true). This is not the only choice: it is consistent with value of 0 (false) meaning no information. But it may generalize to other similarity measures.

In such a case, the problem is really a linear programming problem: a naive approach is to exhaustively search the space of possible objects - not randomly, but guided by the constraints. For example, suppose SIM(A,B) := similarity of object A and object B. Define an order on these vectors.

If SIM(A,B) = N, then choose A=B minimal (like (1,....,1 (N times), 0, .... 0 (1000-N times)), and then choose the minimum C s.t. (A,C), (B,C) have the given values. Once you find an inconsistency, backtrack, and increment.

This will find a consistent answer, although the complexity is very high (but probably better than monte carlo).

Finding a better algorithm is an interesting problem, but more than this I can't say in a SO post - that's probably a topic for a CS thesis!

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top