Flume not writing to HDFS unless killed

Question 1

How much data is getting written? I bet its not writing because you haven't collected enough to trigger a flush to HDFS with the default configuration parameters. There are a number of ways to configure the HDFS sink so that it flushes in a predictable way. You can set it so it flushes on a number of events (hdfs.rollCount), on an interval (hdfs.rollInterval), or on a size (hdfs.rollSize). What is happening is when you kill the agent, it cleans up what it is doing currently and flushes... so basically you are forcing it by killing it.

You can also try lowering hdfs.batchSize.

Remember that Hadoop likes larger files. You should try to avoid lots of small files, in general. So be careful here on rolling too often.

Running it in the foreground like you are, ctrl+c or kill are the only real ways to stop it. In production you should probably be using the init scripts, which have start/stop/restart.

Question 2

Thank you Donald and Praveen:

I could solve the problem by setting the following in my flume-conf file

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

and by deleting this entry

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

Now flume is writing to HDFS on the go.