Question

The Spark Programming Guide mentions slices as a feature of RDDs (both parallel collections or Hadoop datasets.) ("Spark will run one task for each slice of the cluster.") But under the section on RDD persistence, the concept of partitions is used without introduction. Also, the RDD docs only mention partitions with no mention of slices, while the SparkContext docs mentions slices for creating RDDs, but partitions for running jobs on RDDs. Are these two concepts the same? If not, how do they differ?

Tuning - Level of Parallelism indicates that "Spark automatically sets the number of “map” tasks to run on each file according to its size ... and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument...." So does this explain the difference between partitions and slices? Partitions are related to RDD storage and slices are related to degree of parallelism, and by default splices are calculated based upon either data size or number of partitions?

Was it helpful?

Solution

They are the same thing. The documentation has been fixed for Spark 1.2 thanks to Matthew Farrellee. More details in the bug: https://issues.apache.org/jira/browse/SPARK-1701

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top