Spark vs Map reduce

https://datascience.stackexchange.com/questions/9083

16-10-2019
|

Question

I know that Spark enhances performance relative to mapreduce by doing in-memory computations. But, since the caching is explicitly decided by the programmer, one can also proceed without doing that. I believe even in such cases, Spark is 10x faster than map reduce. What features in the framework make this possible?

Solution

No, this is not in general true. For a map-only job, or map and reduce, MapReduce is a bit faster. It's more optimized for this pattern and a lot easier to scale / tune. However, the problem is that few problems are just one or two operations. Once you have a chain of them to execute, considering the entire DAG and execution plan is a win even if memory is not used for persistence.

Developer productivity also related to efficiency. If I don't have to spend my time writing lower-level code then I can spend more time designing a more sophisticated distributed application that could be faster.

You can see this in Crunch, for example, which gives you a high-level language similar to Spark Java on MapReduce, and I'd say you can write more efficient Crunch jobs in a fixed time than M/R.

OTHER TIPS

When you script out steps for spark on an RDD, it does not begin executing the operations until the data needs to be accessed. Instead of actually manipulating the data, spark builds a graph of how to go from the data, to the desired result set, and will not store intermediate data. For example, if you enter the following into a shell

my_rdd = sc.parallelize(some_data)
mapped_data = my_rdd.map(...)
filtered_data = mapped_data.filter(...)
filtered_data.take(3)

you'll notice that when you enter lines 2 and 3, no execution occurs. This is because spark isn't creating mapped_data or filtered_data, but rather, is building a graph of how to get from my_rdd to filtered data. The only reason that you would really write it in this way, is if you need to access mapped_data at a later state, in which case you would want to use a cache.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange