Question

Can somebody outline the various differences between the various Hadoop Distributions available:

using the Apache Hadoop distro as a baseline.

Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?

Was it helpful?

Solution

Disclaimer: I interned at Cloudera this summer (but some of my best friends are at Yahoo! :-))

The Yahoo distribution is a version of Hadoop 20 that they run (ran?) on some subset of their clusters. It includes a set of patches for stability, bug fixes, etc. It is a source release; it does not have admin-friendly features like rpm or debian packages, etc.

The Cloudera distribution is packages as rpms and debs (the source is also available). This means you can get updates via standard methods, etc. It also includes stability and bug fix patches. It is constantly maintained (not to say Yahoo's isn't -- I suppose one could just go on github and check when they last updated it). It also packages Pig and Hive.

Cloudera's distribution of Hadoop 20 is in beta, and 18 is considered stable (more on this on the Cloudera blog). The 18 version also includes packages for Hive and Pig; for 20, you have to build them yourself (there aren't official releases of Pig or Hive that support 20 yet, although patches exist). There may well be significant overlap between the Cloudera and Yahoo versions of 20; both provide manifests, so you can check. The latest documentation of Cloudera's distros is at http://archive.cloudera.com

Yahoo does not provide support for their distribution; they provide their patched version as a service to the community, so the folks who are interested can build what Yahoo runs internally. Given the size of Yahoo clusters, that's a significant contribution, especially if you aren't a Hadoop developer who follows the JIRAs all the time. Cloudera supports their distribution commercially, as well as providing some community support via the Hadoop mailing lists and, for distro-specific issues, on their GetSatisfaction page.

Both are pretty different from the vanilla Apache distro since they patch it in between releases (the cloudera version of 20 has 60+ patches!).

OTHER TIPS

Yahoo has discontinued it's own distribution and focusing on Apache Hadoop.

http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/

http://www.cloudera.com/blog/2011/02/some-news-related-to-the-apache-hadoop-project/

Recently, HortonWorks (www.hortonworks.com) was spun out of Yahoo. And now HortonWorks would also be providing support unlike Yahoo.

http://www.hortonworks.com/about-us/our-manifesto/

Cloudera is along the same lines as HortonWorks

http://www.cloudera.com/products-services/

The main difference is HortonWorks wants to make the Apache distributions stable, easy to install and others. While, Cloudera has it's own distribution CDH* based on the Apache Hadoop.

There are different reasons for choosing a Hadoop distribution such as Cloudera, Hortonworks or MapR instead of Apache Hadoop. Two big advantages are tools support and commercial support. You also have a lot of trouble "collecting and integrating" all Hadoop frameworks such as Pig, Hive, etc. in right and compatible versions.

Take a look at my article at InfoQ. It explains differences between Apache Hadoop, Hadoop distributions and big data suites, and when to use which one:

http://www.infoq.com/articles/BigDataPlatform

Best regards,

Kai Wähner (@KaiWaehner, www.kai-waehner.de/blog)

SquareCog is right on almost all points except: The Yahoo! distribution is what is run on all the production clusters at Yahoo!, not a subset of them. This is more than 25,000 machines in total. The Yahoo! distribution has had the extensive, end-to-end testing necessary to ensure reliable, consistent operation. The other distribution is more liberal about applying patches and so may have more features, but has not been tested as extensively.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top