Question

While learning Spark, it seems that the basic workflow is transformations -> action, and getting the action as a final result.

I have a different workflow, in which I'm not interested in the final result of a calculation, but rather wish to populate a large set of Amazon S3 files based on the transformation. (Imagine doing massively parallel image processing.) I'd like to do something like this:

for each k,v:
   v_ = transform(v)
   make a new s3 file, with key = k, and contents = v_
Was it helpful?

Solution 2

It seems like the best way to do this is to simply write a custom save method and do transform.foreach(x => saveHoweverYouLike(x))

OTHER TIPS

In addition to the other answer, it maybe worth considering RDD.foreachPartition() as well where you can process one-whole-partition at a time. This is beneficial in cases where there is a large setup cost for pushing data out.

transformedRDD.foreachPartition { iteratorOfRecords => 
    setup()    // do some initial connection setup, etc.
    iteratorOfRecords.foreach { keyValue => saveHoweverYouLike(keyValue) }
}

Another minor point to note. Technically, foreach() is also an "action" even though it doesnt return a result. And you have to do an action to force Spark do initiate the lazy evaluation of RDDs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top