Question

We have data of several news sites, having quite literally millions of entries. As each news site publishes their own version of the news (also each news site may publish several different version of the same news), we have several entries that are variants of a single news. I am currently working on separating out "Unique" news from our repository. That means if a single news has several variants, only a single variant (most likely the one reported earliest) will be considered.

I believe, clustering of the news articles can be used to group together similar news. I am currently exploring DBSCAN, and Hierarchical clustering (Ward's Method). I am wondering whether am I moving in the right direction, is clustering the best solution for solving our problem? If yes, which other algorithms and techniques should I explore?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top