Algorithm for similarity (of topic) of news items

https://stackoverflow.com/questions/732934

06-09-2019
|

Question

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.

Solution

First thoughts:

toss away noise words (and, you, is, the, some, ...).
count all other words and sort by quantity.
for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
the score represent the similarity.

It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.

However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.

Based on your comment:

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).

That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).

I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).

One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.

OTHER TIPS

At the moment I am thinking of something like this.

Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.

The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.

Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.

But comments and improvements still highly appreciated.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow