سؤال

I am working with massive data, my input data is about 100 GB.I want to choose one of the hadoop distributions, but i don't know to choose mapr cluster or cloudera cluster. i want to use free versions(mapr M3 and cloudera CDH4 that uses hadoop 0.20). which of them is better? which configurations do i use that they work the best? Thanks.

هل كانت مفيدة؟

المحلول

Actually speaking, answer to this question is the most common answer in this world, it depends. It's totally upto you and your requirements. One might find one particular flavor more suitable for his/her needs, and you might find the same flavor less useful. Moreover it's all about personal choice, like I personally like Apache's Hadoop. All are good. It's just that which one fits into your needs.

Which of them is better? is a controversial topic. Questions like this often end up as heated arguments. See this question for example. So, i'm not going to list down advantages of any one over the other. But there are certain differences among these different flavors of Hadoop which could probably help you during your thought process.

The major difference between CDH(Apache Hadoop as well) and MapR is that MapR uses its own proprietary file system, MapRFS instead of HDFS. The M3 Edition is free and available for unlimited production use. Support is provided on a community basis and through MapR's Forums. CDH is 100% open source and you can use the "Standard" version of Cloudera Manager without any charges. And Apache, well it's Apache :). Do what ever you feel like.

MapR has even partnered recently with Canonical, the organization behind the Ubuntu operating system, in an effort to make Hadoop available as an integrated part of Ubuntu through its repositories. The partnership announced that MapR's M3 Edition for Apache Hadoop will be packaged and made available for download as an integrated part of the Ubuntu operating system(see this if you need more info on this). The source code is available on Github. CDH codebase is same as Apache's, with some patches of their own.

But the free edition lacks some good features like JobTracker HA, NameNode HA, Mirroring, Snapshot etc. CDH4, being based on Hadoop-2.x provides you the HA features though. By virtue of its design MapR doesn't have any SPOF, like CDH3(or Hadoop-1.x) does. The MapRFS stores data in volumes, conceptually in a set of containers distributed across a cluster. Each container includes its own metadata, eliminating the central NameNode single point of failure. Still the API is Apache Hadoop compatible. MapR setup requirements differ from Apache/CDH. Like MapR requires raw volumes to be available for installation for example. Once you have the correct hardware & OS pre-requisites, setup times and eval times should be on the same order of magnitude as Apache/CDH.

IMHO, M3 is not gonna give you huge advantages over Apache/CDH as some of the catchy MapR features are not present in M3 free edition, like NFS-HA, Snapshots etc.

Being the first one Cloudera definitely has an extra edge in terms of experience and a solid customer base. But MapR has gone more innovative in terms of significant changes to the MapReduce and HDFS components to improve performance.

I'll write some more after sometime, as i'm on a call and you are waiting for the answer ;)

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top