Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do
rdd.saveAsTextFile("foo")
It will be saved as "foo/part-XXXXX
" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002
) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002
, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.
The part-XXXXX
files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX
files inside foo as well.