Question

I've used Apache Flume to pipe a large amount of tweets into the HDFS of Hadoop. I was trying to do sentiment analysis on this data - just something simple to begin with, like positive v negative word comparison.

My problem is that all the guides I find showing me how to do it have a text file of positive and negative words and then a huge text file with every tweet.

As I used Flume, all my data is already in Hadoop. When I access it using localhost:50070 I can see the data, in separate files according to month/day/hour, with each file containing three or four tweets. I have maybe 50 of these files for every hour. Although it doesn't say anywhere, I'm assuming they are in JSON format.

Bearing this in mind, how can I perform my analysis on them? In all the examples I've seen where the Mapper and Reducer have been written, there has been a single file this has been performed on, not a large collection of small JSON files. What should my next step be?

Was it helpful?

Solution

This example should get you started https://github.com/cloudera/cdh-twitter-example

Basically use hive external table to map your json data and query using hiveql

OTHER TIPS

When you want to process all the files in a directory, you can just specify the path of the directory as your input file to your hadoop job so that it will consider all the files in that directory as its input.

For example if your small files are in the directory /user/flume/tweets/.... then in your hadoop job you can just specify /user/flume/tweets/ as your input file.

If you want to automate the analysis for every one hour you need to write one oozie workflow.

You can refer to the below link for sentiment analysis in hive

https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top