It seems like the best way to do this is to simply write a custom save method and do transform.foreach(x => saveHoweverYouLike(x))
Saving S3 files from Apache Spark without using action
-
18-07-2023 - |
سؤال
While learning Spark, it seems that the basic workflow is transformations -> action, and getting the action as a final result.
I have a different workflow, in which I'm not interested in the final result of a calculation, but rather wish to populate a large set of Amazon S3 files based on the transformation. (Imagine doing massively parallel image processing.) I'd like to do something like this:
for each k,v:
v_ = transform(v)
make a new s3 file, with key = k, and contents = v_
المحلول 2
نصائح أخرى
In addition to the other answer, it maybe worth considering RDD.foreachPartition() as well where you can process one-whole-partition at a time. This is beneficial in cases where there is a large setup cost for pushing data out.
transformedRDD.foreachPartition { iteratorOfRecords =>
setup() // do some initial connection setup, etc.
iteratorOfRecords.foreach { keyValue => saveHoweverYouLike(keyValue) }
}
Another minor point to note. Technically, foreach() is also an "action" even though it doesnt return a result. And you have to do an action to force Spark do initiate the lazy evaluation of RDDs.