Detecting Similarity in Strings

https://stackoverflow.com/questions/21818128

12-10-2022
|

Frage

If I search for something on Google News, I can click on the "Explore in depth" button and get the same news article from multiple sources. What kind of algorithm is used to compare articles of text and then determine that it is regarding the same thing? I have seen the Question here:

Is there an algorithm that tells the semantic similarity of two phrases

However, using methods mentioned there, I feel that if there were articles that were similar in nature but regarding different stories, they would be grouped together using the methods mentioned there. Is there a standard way of detecting Strings that are about the same thing and grouping them, while keeping Strings that are just similar separate? Eg. If I search "United States Border" I might get stories about problems at the USA's border, but what would prevent these from all getting grouped together? All I can think of is the date of publication, but what if many stories were published very close to each other?

Lösung

One standard way to determine similarity of two articles is create a language model for each of them, and then find the similarity between them.

The language model is usually a probability function, assuming the article was created by a model that randomly selects tokens (words/bigrams/.../ngrams).

The simplest language model is for unigrams (words): P(word|d) = #occurances(w,d)/|d| (the number of times the word appeared in the document, relative to the total length of the document). Smoothing techniques are often used to prevent words having zero probability to appear.

After you have a language model, all you have to do is compare the two models. One way to do it is cosine similarity or Jensen-Shannon similarity.
This gives you an absolute score of similarity of two articles. This can be combined with many other methods, like your suggestion to compare dates.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow