Use spark_csv inside Jupyter and using Python

https://datascience.stackexchange.com/questions/9943

16-10-2019
|

Pergunta

My ultimate goal is to use Jupyter together with Python for data analysis using Spark. The current hurdle I face is loading the external spark_csv library. I am using Mac OS and Anaconda as the Python distribution.

In particular, the following:

from pyspark import SparkContext
sc = SparkContext('local', 'pyspark')
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file.csv')
df.show()

when invoked from Jupyter yields:

Py4JJavaError: An error occurred while calling o22.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

Here are more details:

Setting Spark together with Jupyter

I managed to set up Spark/PySpark in Jupyter/IPython (using Python 3.x).

System initial setting

On my OS X I installed Python using Anaconda. The default version of Python I have currently installed is 3.4.4 (Anaconda 2.4.0). Note, that I also have installed also 2.x version of Python using conda create -n python2 python=2.7.

Installing Spark

This is actually the simplest step; download the latest binaries into ~/Applications or some other directory of your choice. Next, untar the archive tar -xzf spark-X.Y.Z-bin-hadoopX.Y.tgz. For easy access to Spark create a symbolic link to the Spark:

ln -s ~/Applications/spark-X.Y.Z-bin-hadoopX.Y ~/Applications/spark

Lastly, add the Spark symbolic link to the PATH:

export SPARK_HOME=~/Applications/spark
export PATH=$SPARK_HOME/bin:$PATH

You can now run Spark/PySpark locally: simply invoke spark-shell or pyspark.

Setting Jupyter

In order to use Spark from within a Jupyter notebook, prepand the following to PYTHONPATH:

export PYTHONPATH=$SPARKHOME/python/lib/py4j-0.8.2.1-src.zip:$SPARKHOME/python/:$PYTHONPATH

Further details can be found here.

Solução

Assuming the rest of your configuration is correct all you have to do is to make spark-csv jar available to your program. There are a few ways you can achieve this:

manually download required jars including spark-csv and csv parser (for example org.apache.commons.commons-csv) and put them somewhere on the CLASSPATH.
using --packages option (use Scala version which has been used to build Spark. Pre-built versions use 2.10):
- using PYSPARK_SUBMIT_ARGS environmental variable:
```
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
```
- adding Gradle string to spark.jars.packages in conf/spark-defaults.conf:
```
spark.jars.packages    com.databricks:spark-csv_2.11:1.3.0
```

Outras dicas

Use the following procedure on your Mac--

vi to ~/.bash_profile (~/.zshrc if you're on that train)
Paste the following entry (be sure to specify your desired version of spark-csv)--

export PYSPARK_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.3.0 $PYSPARK_SUBMIT_ARGS"

From there, run 'ipython notebook' and test with something like this--

import pyspark as ps
from pyspark.sql import SQLContext

sc = ps.SparkContext()

input_csv = 'file:////PATH_TO_CSV_ON_LOCAL_FILESYSTEM'

df=sqlContext.read.load(input_csv,format='com.databricks.spark.csv', header='true', inferSchema='true')

df.dtypes  # Returns the csv's schema breakdown with types

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange