Clustering of Twitter Feeds

Question

As Anony Mousse pointed out in his comment above, TF/IDF is only a normalization measure to make sure words that are overly popular among all documents don't gain too much important.

For data preparation, I'd recommend reading this and the second part of it too (linked via the above link), if you haven't already done so. It is very important to get a vector of numbers from each tweet. In general, in machine learning, it is important to get a feature vector because that way, you can apply mathematical algorithms to your data then.

Now that you have a feature vector for each tweet in your collection, things get a bit simple. There are two clustering algorithms that come to my mind that you can whip up in a couple of hours each, with maybe extensive testing taking a weekend.

K-Means Clustering
Hierarchical Clustering With Single Linkage

With 100,000 tweets only, you should actually be able to implement these algorithms on a single computer (i.e. this is not big data -- no need for cluster computing), using your favorite language (C++, Java, Python, MATLAB, etc.). Personally, I think it's easier to implement K-Means Clustering (which I have done before) compared to Hierarchical Clustering (which I have also done before).

EDIT: Please follow the below comments only if you have labeled training data, i.e. you have tweets say, with labeled sentiments (happy-user, ok-ok, bad product, angry-user, abusive-user) and the question you want to answer is: Given a new tweet, what is it's sentiment?

Here is one very good resource you should look at, to get a better understanding of K-Nearest Neighbors:

Laszlo Kozma's Slides

In general, for the other two algorithms, there are ample resources, with Wikipedia articles the best way to start. Personally, I feel K-Nearest Neighbors (shorthand k-NN) is the easiest of the three to implement and will give you quick results.