Frage

I need streaming read to very large files(in TBs). To achieve higher throughput, if we can cache the file parts in memory. Spark can cache data in distributed memory. How can I use spark to cache file parts ?

Files are bigger than the local storage of any one computer and bigger than the sum total capacity of memory in the cluster.

War es hilfreich?

Lösung

  1. Store the data in a distributed storage system like HDFS, etc. This will store your data in a distributed manner. You have to choose the right file system depending on your requirement (on-premise, or in cloud, etc.)

  2. Run Spark on the data in the HDFS file. Create an RDD from the file (see spark documentation), filter out the part of the data you actually needs (example, only the lines containing "error" in a large log file), and cache the necessary part in memory (so that subsequent queries are faster).

There are number of caching related parameters that you can tune that help you to fit your data in memory (keeping data serialized with kryo serialization, etc.). See Memory Tuning guide for defails.

You can also consider breaking the data into parts (separate files, partitioned tables, etc.) and load only a part of it Spark.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top