How to cache file parts in memory as RDD in Spark?

https://stackoverflow.com/questions/23539511

17-07-2023
|

Frage

I need streaming read to very large files(in TBs). To achieve higher throughput, if we can cache the file parts in memory. Spark can cache data in distributed memory. How can I use spark to cache file parts ?

Files are bigger than the local storage of any one computer and bigger than the sum total capacity of memory in the cluster.

Lösung

Store the data in a distributed storage system like HDFS, etc. This will store your data in a distributed manner. You have to choose the right file system depending on your requirement (on-premise, or in cloud, etc.)
Run Spark on the data in the HDFS file. Create an RDD from the file (see spark documentation), filter out the part of the data you actually needs (example, only the lines containing "error" in a large log file), and cache the necessary part in memory (so that subsequent queries are faster).

There are number of caching related parameters that you can tune that help you to fit your data in memory (keeping data serialized with kryo serialization, etc.). See Memory Tuning guide for defails.

You can also consider breaking the data into parts (separate files, partitioned tables, etc.) and load only a part of it Spark.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow