Increasing available memory to Spark

Question 1

If you are trying to solve a problem on a single computer, I do not think it is practical to use Spark. The point of Spark is that it provides a way to distribute computation across multiple machines, especially in cases where the data does not fit on a single machine.

That said, just set spark.executor.memory to 20g to get 20 GB of virtual memory. Once the physical memory is exhausted, swap will be used instead. If you have enough swap configured, you will be able to make use of 20 GB. But your process will most likely slow down to a crawl when it starts swapping.

Question 2

If your job does not fit into memory Spark will automatically spill to disk - you do NOT need to setup swap - i.e. Daniel's answer is a bit inaccurate. You can configure what kind of processing will and will not spill to disk using the configuration settings: http://spark.apache.org/docs/0.9.1/configuration.html

Also it IS a good idea to use Spark on a single machine, because it means if you need your application to scale, you will get scaling for free - the same code you write to run 1-node will work N-node. Of course if your data is never expected to grow, then yes, stick with pure Scala.

Use spark.shuffle.spill to control whether shuffles spill, and read the "persistence" documentation to control how RDD caching spills http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence