سؤال

There is plenty of hype surrounding Hadoop and its eco-system. However, in practice, where many data sets are in the terabyte range, is it not more reasonable to use Amazon RedShift for querying large data sets, rather than spending time and effort building a Hadoop cluster?

Also, how does Amazon Redshift compare with Hadoop with respect to setup complexity, cost, and performance?

هل كانت مفيدة؟

المحلول

tl;dr: They markedly differ in many aspects and I can't think Redshift will replace Hadoop.

-Function
You can't run anything other than SQL on Redshift. Perhaps most importantly, you can't run any type of custom functions on Redshift. In Hadoop you can, using many languages (Java, Python, Ruby.. you name it). For example, NLP in Hadoop is easy, while it's more or less impossible in Redshift. I.e. there are lots of things you can do in Hadoop but not on Redshift. This is probably the most important difference.

-Performance Profile
Query execution on Redshift is in most cases significantly more efficient than on Hadoop. However, this efficiency comes from the indexing that is done when the data is loaded into Redshift (I'm using the term indexing very loose here). Therefore, it's great if you load your data once and execute multiple queries, but if you want to execute only one query for example, you might actually lose out in performance overall.

-Cost Profile
Which solution wins out in cost depends on the situation (like performance), but you probably need quite a lot of queries in order to make it cheaper than Hadoop (more specifically Amazon's Elastic Map Reduce). For example, if you are doing OLAP, it's very likely that Redshift comes out cheaper. If you do daily batch ETLs, Hadoop is more likely to come out cheaper.

Having said that, we've replaced part of our ETL that was done in Hive to Redshift, and it was a pretty great experience; mostly for the ease of development. Redshift's Query Engine is based on PostgreSQL and is very mature, compared to Hive's. Its ACID characteristics make it easier to reason about it, and the quicker response time allows more testing to be done. It's a great tool to have, but it won't replace Hadoop.

EDIT: As for setup complexity, I'd even say it's easier with Hadoop if you use AWS's EMR. Their tools are so mature that it's ridiculously easy to have your Hadoop job running. Tools and mechanisms surrounding Redshift's operation aren't that mature yet. For example, Redshift can't handle trickle loading and thus you have to come up with something that turns that into a batched load, which can add some complexity to your ETL.

نصائح أخرى

Current size limit for Amazon Redshift is 128 nodes or 2 PBs of compressed data. Might be circa 6PB uncompressed though mileage varies for compression. You can always let us know if you need more. anurag@aws (I run Amazon Redshift and Amazon EMR)

Personally, I don't think it's all that difficult to set up a hadoop cluster, but I know that it is sometimes painful when you are getting started.

HDFS size limitations well exceed a TB (or did you mean exabyte?). If I'm not mistaken it scales to yottabytes or some other measurement that I don't even know the word for. Whatever it is, it's really big.

Tools like Redshift have their place, but I always worry about vendor specific solutions. My main concern is always "what do I do when I am dissatisfied with their service?" - I can go to google and shift my analysis work into their paradigm or I can go to hadoop and shift that same work into that system. Either way, I'm going to have to learn something new and do a lot of work translating things.

That being said, it's nice to be able to upload a dataset and get to work quickly - especially if what I'm doing has a short lifecycle. Amazon has done a good job of answering the data security problem.

If you want to avoid hadoop, there will always be an alternative. But it's not all that difficult to work with once you get going with it.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top