Domanda

I have 25-20 agents sending the data to couple of collector agents and and these collector agents then have to write it to the HDFS.

Where to run these collector agents? On the Data node of the Hadoop cluster or outside of the cluster? What are the pros/cons of each and how are people currently running them?

È stato utile?

Soluzione

tier 2 flume agents use hdfsSink write directly to HDFS. what's more , Tier1 can use failover sinkgroup. In case of one of tier 2 flume agent is down.

Altri suggerimenti

I assume your using something like Flume. If that's the case, the Flume agent (at least the first tier) runs where ever the data is being sourced from. IE: Web Server for Web Logs..

Flume does support other protocols, like JMS, so the location will vary in those scenario's.

For production clusters, you don't want to run "agents" like flume on Datanodes. Best to level the resources of that hardware for the cluster.

If you have a lot of agents, you'll want to use a tiered architecture to consolidate and funnel the numerous sources into a smaller set of agents that will write to HDFS. This helps control visibility and exposure of the cluster to external servers.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top