Question

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).

Is it possible?

RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]

Was it helpful?

Solution

You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:

val data = ...
data.count
...
res1: Long = 18066983

val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190

The third parameter is a seed, and is thankfully optional in the next Spark version.

OTHER TIPS

RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)

I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.

An example from spark-shell:

import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000

val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)

val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))

val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))

millionRdd.count          // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions

Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.

This approach seems dodgy to me though. Is there a nicer way?

I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top