Frage

I am looking for a way to measure the semantic distance between two sentences. Suppose we have the following sentences:

(S1) The beautiful cherry blossoms in Japan. 
(S2) The beautiful Japan.

S2 is created from S1 by eliminating the words "cherry", "blossoms" and "in". I want to define a function that gives a high distance between S1 and S2. The reason for this is that they do have significantly different meaning, since beautiful modifies cherry blossoms and not Japan.

War es hilfreich?

Lösung 2

I think that research has made a lot of advances in that area and now the distance between the meaning of sentences can be calculated via several methods thanks to the development of word vectors and transformers:

  1. Google universal sentence encoder (USE): https://tfhub.dev/google/universal-sentence-encoder/2

  2. Infersent by facebook: https://github.com/facebookresearch/InferSent

  3. Averaging the word vectors (with cosine similarity).

  4. Spacy also provide a similarity between two sentences based on word vectors: https://spacy.io/usage/spacy-101

  5. ELMo: https://github.com/HIT-SCIR/ELMoForManyLangs

  6. Bert: https://github.com/google-research/bert

  7. ALBERT: https://github.com/google-research/ALBERT

  8. RoBERTa: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/

  9. XLNET: https://github.com/zihangdai/xlnet

  10. ELECTRA: https://github.com/google-research/electra

etc

Andere Tipps

As Rob pointed out, this is a very hard problem. It requires the program to not only understand linguistic semantics, but also have encyclopedic knowledge. For example, when we say "The beautiful cherry blossoms in Japan", are we talking about a cherry that is beautiful, and happens to blossom in Japan, or are we talking about a single collective entity "cherry blossoms", which are beautiful and happen to be in Japan? This requires a combination of encyclopedic and linguistic knowledge.

From a purely encyclopedic perspective, consider the sentences

  1. The beautiful cherry blossoms in Japan.
  2. The beautiful sakura in Japan.
  3. The beautiful flowers in Japan.

The first two are identical, while the third is closely related, but not identical. Establishing sentence distance based on this kind of knowledge is beyond the scope of just a grammatical analysis, and require the use of external ontologies (e.g. sakura = cherry blossom, and that cherry blossom IS_A flower).

Having said that, there is a little bit that can be done based on parse trees of sentences. For example, if you look at the constituency parse trees of the two sentences you provided, you will be able to break them down into phrases (NP, VP, etc.). For many examples, it will suffice to define the distance between two sentences as the max of the distance between its constituent phrases, where the distance between phrases can, in turn, be based on lexical databases such as WordNet or ontologies such as Yago.

For WordNet, a readily available package to measure semantic distances is the Java-based package WS4J. They have an online demo as well. These semantic distances are based on the path-distance between two terms in the ontology graph (except LESK, which simply calculates the overlap of terms in dictionary glosses).

This is far, far away from a complete solution to the problem of measuring semantic distance, but I hope it will give you a starting point.

Try out models based on BERT, e.g.,

MoverScore: https://pypi.org/project/moverscore/

which is very good for capturing the semantic similarity of two sentences. Paper reference: https://arxiv.org/abs/1909.02622

Also you may want to look for tasks such as "STS" (semantic textual similarity).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top