Article "Matching" Algorithm

Question

There are many ways to find 'similarity' of articles, and it really depends on what you know on the articles, and what you use as your test case to show how good your results are.

One simple solution is using Jaccard Similarity on the vocabulary used by these documents. Pseudo code:

similarity(doc1,doc2):
   set1 <- getWords(doc1)
   set2 <- getWords(doc2)
   intersection <- set_intersection(set1,set2)
   union <- set_union(set1,set2)
   return size(intersection)/size(union)

Note that instead of getWords you can use also bigrams,trigrams,...n-grams.

More complex unsupervised solution could be building a language model from each document, and calculate their Jensen-Shannon divergence to judge if they are similar or not, based on the language models.
A simple language model is P(word|document) = #occurances(word,document)/size(document)
Usually we use some smoothing techniques to make sure no word has probability 0.

Other solutions are using supervised learning algorithms such as SVM. Your features can be the words (tf-idf model / bag of words model /...) and use these features to classify if doc1,doc2 are 'similar'. This requires obtaining a 'training set' that is basically a set of samples (doc1,doc2) and lables that tells you if (doc1,doc2) are 'smilar' or not. Feed the training data to a learner and build a model - that will later be used to classify new pairs of documents.