Question

I am new to clustering, just implemented a couple of algorithms before. I need to cluster tweets according to their similarity. One way is to use only hash tags, but I don't think it would be that informative. So complete tweets should be analyzed.

Moreover I was searching the web for the algorithms for clustering feeds.

One I encountered is TF-IDF. I want to know are there better algorithms which can be implemented in few hours and are better than TF-IDF.Also I would be intersetd in some informatics source about the clustering of twitter feeds.

PS: No. of tweets : 10^5

Was it helpful?

Solution

As Anony Mousse pointed out in his comment above, TF/IDF is only a normalization measure to make sure words that are overly popular among all documents don't gain too much important.

For data preparation, I'd recommend reading this and the second part of it too (linked via the above link), if you haven't already done so. It is very important to get a vector of numbers from each tweet. In general, in machine learning, it is important to get a feature vector because that way, you can apply mathematical algorithms to your data then.

Now that you have a feature vector for each tweet in your collection, things get a bit simple. There are two clustering algorithms that come to my mind that you can whip up in a couple of hours each, with maybe extensive testing taking a weekend.

  • K-Means Clustering
  • Hierarchical Clustering With Single Linkage

With 100,000 tweets only, you should actually be able to implement these algorithms on a single computer (i.e. this is not big data -- no need for cluster computing), using your favorite language (C++, Java, Python, MATLAB, etc.). Personally, I think it's easier to implement K-Means Clustering (which I have done before) compared to Hierarchical Clustering (which I have also done before).

EDIT: Please follow the below comments only if you have labeled training data, i.e. you have tweets say, with labeled sentiments (happy-user, ok-ok, bad product, angry-user, abusive-user) and the question you want to answer is: Given a new tweet, what is it's sentiment?

Here is one very good resource you should look at, to get a better understanding of K-Nearest Neighbors:

In general, for the other two algorithms, there are ample resources, with Wikipedia articles the best way to start. Personally, I feel K-Nearest Neighbors (shorthand k-NN) is the easiest of the three to implement and will give you quick results.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top