Scan-based operations Apache Spark

https://datascience.stackexchange.com/questions/8402

apache-spark

16-10-2019
|

Question

Looking at the first paper on RDDs/Apache Spark, I found a statement saying that "RDDs degrade gracefully when there is not enough memory to store them, as long as they are only being used in scan-based operations"

What are scan-based operations in the context of RDDs and which of the Transformations in Spark are scan-based operations

Solution

Scan based operations are basically all the operations that require evaluating the predicate on an RDD.

In other terms, each time your create an RDD or a DataFrame in which you need to compute a predicate like performing a filter, map on a case class, per example, or even explain method will be considered as a scan based operation.

To be more clear, let's review the definition of a predicate.

A predicate or a functional predicate is a logical symbol that may be applied to an object term to produce another object term.

Functional predicates are also sometimes called mappings, but that term can have other meanings as well.

Example :

// scan based transformation
rdd.filter(!_.contains("#")) // here the predicate is !_.contains("#")

// another scan based transformation
rdd.filter(myfunc) // myfunc is a boolean function

// a third also trivial scan based transformation followed by a non scan based one.
rdd.map(myfunc2) 
   .reduce(myfunc3)

If you want to understand how spark internals work, I suggest that you watch the presentation made by Databricks about the topics

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange