Question

I have 25-20 agents sending the data to couple of collector agents and and these collector agents then have to write it to the HDFS.

Where to run these collector agents? On the Data node of the Hadoop cluster or outside of the cluster? What are the pros/cons of each and how are people currently running them?

Was it helpful?

Solution

tier 2 flume agents use hdfsSink write directly to HDFS. what's more , Tier1 can use failover sinkgroup. In case of one of tier 2 flume agent is down.

OTHER TIPS

I assume your using something like Flume. If that's the case, the Flume agent (at least the first tier) runs where ever the data is being sourced from. IE: Web Server for Web Logs..

Flume does support other protocols, like JMS, so the location will vary in those scenario's.

For production clusters, you don't want to run "agents" like flume on Datanodes. Best to level the resources of that hardware for the cluster.

If you have a lot of agents, you'll want to use a tiered architecture to consolidate and funnel the numerous sources into a smaller set of agents that will write to HDFS. This helps control visibility and exposure of the cluster to external servers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top