I think the parallelization you can use decreases as the algorithm runs.
You may try to parallelize the search for the lightest node of each component and leave to a single machine the union part of the algorithm. The single machine will distribute the components to each sub-machine