Domanda

How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.

The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.

enter image description here

Any help would be greatly appreciated.

Thanks

jvc007

È stato utile?

Soluzione

One way to introduce a distance or similarity measure between documents is:

  • first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

  • the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.

Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities should help to visualize the documents in a two dimensional plot.

Altri suggerimenti

The problem of mapping high-dimensional data to low dimensional space while preserving similarity can be solved using Self-organizing map (SOM or Kohonen network). I have already seen some applications on documents.

I don't know about any python implementation (there might be one), but there is a good one for Matlab (SOM toolbox).

I think what you're looking for is locality-sensitive hashing. See this answer for a nice, graphical explanation and sample code.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top