Do records with the same key in two RDDs repartitioned by key reside in the same node in spark?

https://datascience.stackexchange.com/questions/15327

machine-learning
bigdata
parallel
apache-spark

16-10-2019
|

Question

I have two RDDs named "data" and "model", they are repartitioned by key described as below :

Does the tuple records with the same key reside in the same node in my cluster ?

Should it save IO cost in shuffle operation, such as "data.cogroup(model)" , if it comes true ?

Solution

The tuple of one partition is always on the same node because a partition itself is impartible. So if you do a groupBy or write your own partitioner which partitions by key, all records with the same key/partition number will be shuffled to the same node.

Otherwise, transformations like mapPartition which pass an iterator to a user defined function wouldn't work.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange