Question

I've rather specific question, at least it is so for me. Specific because after doing quite a lot searching I couldn't find anything useful. So as the title says, I am looking for an algorithm, that finds if two articles given in input "match", but not in the sense of usual string matching, instead, what I want to find is, if they talk for the same argument. Now what I predict, the "match" should be compared against some threshold, and using some kind of weights to determine how much do they "match", therefore the concept is fuzzy, so we can't talk about a complete "match", but we will talk about degree of "match".

Sadly, I don't have anything more. I would be really grateful if someone of you helps me in the topic, also theoretical ideas are welcome.

Thanks you.

Était-ce utile?

La solution

There are many ways to find 'similarity' of articles, and it really depends on what you know on the articles, and what you use as your test case to show how good your results are.

One simple solution is using Jaccard Similarity on the vocabulary used by these documents. Pseudo code:

similarity(doc1,doc2):
   set1 <- getWords(doc1)
   set2 <- getWords(doc2)
   intersection <- set_intersection(set1,set2)
   union <- set_union(set1,set2)
   return size(intersection)/size(union)

Note that instead of getWords you can use also bigrams,trigrams,...n-grams.


More complex unsupervised solution could be building a language model from each document, and calculate their Jensen-Shannon divergence to judge if they are similar or not, based on the language models.
A simple language model is P(word|document) = #occurances(word,document)/size(document)
Usually we use some smoothing techniques to make sure no word has probability 0.


Other solutions are using supervised learning algorithms such as SVM. Your features can be the words (tf-idf model / bag of words model /...) and use these features to classify if doc1,doc2 are 'similar'. This requires obtaining a 'training set' that is basically a set of samples (doc1,doc2) and lables that tells you if (doc1,doc2) are 'smilar' or not. Feed the training data to a learner and build a model - that will later be used to classify new pairs of documents.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top