Indexação semântica latente

https://stackoverflow.com/questions/1771050

21-09-2019
|

Pergunta

Dizem que, através da LSI, as matrizes produzidas U, A e V, elas reúnem documentos que têm sinônimos. Por exemplo, se pesquisarmos por "carro", também obtemos documentos que possuem "automóvel". Mas o LSI não passa de manipulações de matrizes. Isso só leva em consideração a frequência, não a semântica. Então, o que está por trás dessa mágica que estou perdendo? Por favor explique.

Solução

LSI basically creates a frequency profile of each document, and looks for documents with similar frequency profiles. If the remainder of the frequency profile is enough alike, it'll classify two documents as being fairly similar, even if one systematically substitutes some words. Conversely, if the frequency profiles are different, it can/will classify documents as different, even if they share frequent use of a few specific terms (e.g., "file" being related to a computer in some cases, and a thing that's used to cut and smooth metal in other cases).

LSI is also typically used with relatively large groups of documents. The other documents can help in finding similarities as well -- even if document A and B look substantially different, if document C uses quite a few terms from both A and B, it can help in finding that A and B are really fairly similar.

Outras dicas

According to the Wikipedia article, "LSI is based on the principle that words that are used in the same contexts tend to have similar meanings." That is, if two words seem to be used interchangeably, they might be synonyms.

It's not infallible.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow