Question

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?.

dt = sc.parallelize([2, 3, 4, 5, 6])
dt.cache()
Was it helpful?

Solution

It will not expire until Spark is out of memory, at which point it will remove RDDs from cache which are used least often. When you ask for something that has been uncached it will recalculate the pipeline and put it in cache again. If this would be too expensive, unpersist other RDDs, don't cache them in the first place or persist them on your file system.

OTHER TIPS

In addition to Jan's answer, I would like to point out that serialized RDD storage(/caching) works much better than normal RDD caching for large datasets.

It also helps optimize garbage collection, in case of large datasets.

Additionally, from the spark docs:

When your objects are still too large to efficiently store despite this tuning, a much simpler way to reduce memory usage is to store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER. Spark will then store each RDD partition as one large byte array. The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly. We highly recommend using Kryo if you want to cache data in serialized form, as it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).

Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, please check into the Spark UI and check the Storage tab and look into the Memory details.

From the terminal, you can use rdd.unpersist() or sqlContext.uncacheTable("sparktable") to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top