Question

To increase the MAX available memory I use :

export SPARK_MEM=1 g

Alternatively I can use

val conf = new SparkConf()
             .setMaster("local")
             .setAppName("My application")
             .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

The process I'm running requires much more than 1g. I would like to use 20g but I just have 8g of RAM available. Can disk memory be augmented with RAM memory as part of a Spark job, if so how is this achieved ?

Is there a Spark doc which describes how to distribute jobs to multiple Spark installations ?

For spark configuration I'm using all defaults (specified at http://spark.apache.org/docs/0.9.0/configuration.html) except for what I have specified above. I have a single machine instance with following :

CPU : 4 cores
RAM : 8GB
HD : 40GB

Update :

I think this is the doc I'm looking for : http://spark.apache.org/docs/0.9.1/spark-standalone.html

Was it helpful?

Solution 2

If you are trying to solve a problem on a single computer, I do not think it is practical to use Spark. The point of Spark is that it provides a way to distribute computation across multiple machines, especially in cases where the data does not fit on a single machine.

That said, just set spark.executor.memory to 20g to get 20 GB of virtual memory. Once the physical memory is exhausted, swap will be used instead. If you have enough swap configured, you will be able to make use of 20 GB. But your process will most likely slow down to a crawl when it starts swapping.

OTHER TIPS

If your job does not fit into memory Spark will automatically spill to disk - you do NOT need to setup swap - i.e. Daniel's answer is a bit inaccurate. You can configure what kind of processing will and will not spill to disk using the configuration settings: http://spark.apache.org/docs/0.9.1/configuration.html

Also it IS a good idea to use Spark on a single machine, because it means if you need your application to scale, you will get scaling for free - the same code you write to run 1-node will work N-node. Of course if your data is never expected to grow, then yes, stick with pure Scala.

Use spark.shuffle.spill to control whether shuffles spill, and read the "persistence" documentation to control how RDD caching spills http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top