spark join operation based on two columns

Question 1

    val emp = sc.
      textFile("emp.txt").
      map { line =>
        val parts = line.split("\t")
        // we need to output (Naturalkey, (FactId, Amount)) in
        // order to be able to join with the dimension data.
        ((parts(0), parts(2)),parts(1))
      }

    val emp_new = sc.
      textFile("emp_new.txt").
      map { line =>
        val parts = line.split("\t")
        // we need to output (Naturalkey, (FactId, Amount)) in
        // order to be able to join with the dimension data.
        ((parts(0), parts(2)),parts(1))
      }

    val finalemp = 
      emp_new.join(emp).
      map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }

Question 2

If you look at the signature of join it works on an RDD of pairs:

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2

val left: RDD[(String, String, V1)] = ??? // some rdd

val right: RDD[(String, String, V2)] = ??? // some rdd

left.map {
  case (key1, key2, value) => ((key1, key2), value)
}
.join(
  right.map {
    case (key1, key2, value) => ((key1, key2), value)
  })

This will give you an RDD of the form RDD[(String, String), (V1, V2)]

Question 3

rdd1 Schema : field1,field2, field3, fieldX,.....

rdd2 Schema : field1, field2, field3, fieldY,.....

val joinResult = rdd1.join(rdd2, Seq("field1", "field2", "field3"), "outer")

joinResult schema : field1, field2, field3, fieldX, fieldY, ......