Question

I followed the link for setting TwitterSource and HDFS sink.

Command used for starting the agent:

bin/flume-ng agent -n TwitterAgent -c conf -f conf/flume-conf.properties -Dflume.root.logger=DEBUG,console

I was successful in doing that, but there is one problem now. The output doesn't get written until stop the flume agent. It gets written to the HDFS once i kill my flume agent.

I have two questions here:

1) How to stop a flume agent - Is there any way other than doing CTRL+C.

2) Can I make the flume-agent write to the HDFS on the go, without having to kill the agent.

Note : After killing the process, the file that is written has .tmp as extension. Is it expected?

EDIT:

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
Was it helpful?

Solution

How much data is getting written? I bet its not writing because you haven't collected enough to trigger a flush to HDFS with the default configuration parameters. There are a number of ways to configure the HDFS sink so that it flushes in a predictable way. You can set it so it flushes on a number of events (hdfs.rollCount), on an interval (hdfs.rollInterval), or on a size (hdfs.rollSize). What is happening is when you kill the agent, it cleans up what it is doing currently and flushes... so basically you are forcing it by killing it.

You can also try lowering hdfs.batchSize.

Remember that Hadoop likes larger files. You should try to avoid lots of small files, in general. So be careful here on rolling too often.


Running it in the foreground like you are, ctrl+c or kill are the only real ways to stop it. In production you should probably be using the init scripts, which have start/stop/restart.

OTHER TIPS

Thank you Donald and Praveen:

I could solve the problem by setting the following in my flume-conf file

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

and by deleting this entry

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

Now flume is writing to HDFS on the go.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top