Question

I have two RDDs named "data" and "model", they are repartitioned by key described as below : enter image description here

Does the tuple records with the same key reside in the same node in my cluster ?

Should it save IO cost in shuffle operation, such as "data.cogroup(model)" , if it comes true ?

Was it helpful?

Solution

The tuple of one partition is always on the same node because a partition itself is impartible. So if you do a groupBy or write your own partitioner which partitions by key, all records with the same key/partition number will be shuffled to the same node.

Otherwise, transformations like mapPartition which pass an iterator to a user defined function wouldn't work.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top