Equivalent to left outer join in SPARK

Question 1

Spark Scala does have the support of left outer join. Have a look here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Usage is quite simple as

rdd1.leftOuterJoin(rdd2)

Question 2

It is as simple as rdd1.leftOuterJoin(rdd2) but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.

Question 3

Yes, there is. Have a look at the DStream APIs and they have provided left as well as right outer joins.

If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :

var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)

As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. left and right streams will be of type DStream[(Long, Record)] before you call that join function. (It is just an example. The Hash type can be of some type other than Long as well.)

Question 4

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly:

https://spark.apache.org/docs/latest/sql-programming-guide.html

Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.