In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
- Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
- Spark 0.8.1 added support for persist(), sample(), and sort().
- The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
- Spark 0.9 also adds Python bindings for MLLib (docs).
- I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
- zip() / zipPartitions.
- Support for reading and writing non-text input formats, like Hadoop
SequenceFile
(there's an open pull request for this). - Support for running on YARN clusters.
- Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
- Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark
interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark
in your shell to launch Pyspark with an IPython shell.