Correct way to set Spark variables in jupyter notebook

https://datascience.stackexchange.com/questions/14020

16-10-2019
|

Pergunta

I need to set a couple of variables in my Jupyter Notebook where I have a pre-existing sparkContext and sqlContext and I am doing it wrong. If I don't include an
sc.stop() , I get an error that I am trying to instantiate a second context. If I do include it, I get an error that I am trying to call methods on a stopped context.

Can someone tell me the correct way to set these variables?

Here is my code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
conf = (SparkConf()
       .set("spark.yarn.executor.memoryOverhead", "4096")
       .set("spark.kryoserializer.buffer.max.mb", "1024"))

sc.stop()
sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)

Solução

When you run Spark in the shell the SparkConf object is already created for you. As stated in the documentation once a SparkConf object is passed to Spark, it can no longer be modified by the user. So stopping it and creating a new one is actually the right way to do it.

However, this should now be possible for Spark 2.0 and higher.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange