Question

From the Spark documentation:

Spark can cache datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using: errors.cache()

My understanding was that Spark performed all operations in memory by default?

So what happens when the result of an operation is not cached, is it by default persisted to disk?

Or does it just means the results of the operation will be kept in memory after its execution?

Was it helpful?

Solution

My understanding was that Spark performed all operations in memory by default?

No, actually most operators are not caching the result in memory. You need to explicitly call cache to store them in memory.

So what happens when the result of an operation is not cached, is it by default persisted to disk?

For most of operators, Spark just create a new RDD to wrap the old RDD. From "Fast Data Processing with Spark":

It is crucial to understand that even though an RDD is defined, it does not actually contain data. This means that when you go to access the data in an RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD. This means that you can chain a large number of operations together, and not have to worry about excessive blocking. It's important to note that during the application development, you can write code, compile it, and even run your job, and unless you materialize the RDD, your code may not have even tried to load the original data.

So until you call some methods to fetch the result, the computation won't start. Here the materialize operators are something like, first, collect, saveAsTextFile. The result does not store in the memory unless you call cache.

In addition, "Fast Data Processing with Spark" is a great book to learn Spark.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top