Correct way to set Spark variables in jupyter notebook
Pergunta
I need to set a couple of variables in my Jupyter Notebook where I have a pre-existing sparkContext and sqlContext and I am doing it wrong. If I don't include an
sc.stop()
, I get an error that I am trying to instantiate a second context. If I do include it, I get an error that I am trying to call methods on a stopped context.
Can someone tell me the correct way to set these variables?
Here is my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.set("spark.yarn.executor.memoryOverhead", "4096")
.set("spark.kryoserializer.buffer.max.mb", "1024"))
sc.stop()
sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)
Solução
When you run Spark in the shell the SparkConf object is already created for you. As stated in the documentation once a SparkConf object is passed to Spark, it can no longer be modified by the user. So stopping it and creating a new one is actually the right way to do it.
However, this should now be possible for Spark 2.0 and higher.