Question

Is there anyone who got a chance to work on both? I need to set up a framework to move data around. Basically, we have clickstream data coming in as text files. This data needs to be moved around form the app-servers to HDFS, and then to S3 after archival.

I need help in choosing between Flume and Scribe. Which one is better in terms of manageability, setting up and which is easier to customize?

Was it helpful?

Solution

View the answer posted here

I'll quote the answer:

  1. Flume allows you to configure your Flume installation from a central point, without having to ssh into every machine, update a configuration variable and restart a daemon or two. You can start, stop, create, delete and reconfigure logical nodes on any machine running Flume from any command line in your network with the Flume jar available.

  2. Flume also has centralised liveness monitoring. We've heard a couple of stories of Scribe processes silently failing, but lying undiscovered for days until the rest of the Scribe installation starts creaking under the increased load. Flume allows you to see the health of all your logical nodes in one place (note that this is different from machine liveness monitoring; often the machine stays up while the process might fail).

  3. Flume supports three distinct types of reliability guarantees, allowing you to make tradeoffs between resource usage and reliability. In particular, Flume supports fully ACKed reliability, with the guarantee that all events will eventually make their way through the event flow.

  4. Flume's also really extensible - it's really easy to write your own source or sink and integrate most any system with Flume. If rolling your own is impractical, it's often very straightforward to have your applications output events in a form that Flume can understand (Flume can run Unix processes, for example, so if you can use shell script to get at your data, you're golden).

This isn't an exhaustive list of benefits to using Flume - I haven't touched on using decorators for lightweight transformation or metadata extraction, the configuration language, the ability to run several logical nodes in a single Flume process, automatic bucketing and rolling of log files in HDFS... there's lots more about Flume that we're looking forward to sharing with everyone.

The key difference to me is that Cloudera is actively supporting Flume. While I do generally trust Facebook to maintain great open source projects, Cloudera's business is built around providing support for tools like this, so I have faith that Flume will longterm be better supported. I want to minimize the time I have to think about this particular problem. That said, so far I've had a lot of annoying issues where Flume was either a bit convoluted in its abstraction or buggy in its implementation, as you might expect from a pre-1.0 technology. If Asana weren't still in beta, I'd probably have chosen Scribe

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top