Question

I would like to perform a sort-merge join as described in the Hive manual (Bucketed Map Join) using the following options

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

Both tables must be bucketed and sorted on the join column. My question is - does the sort have to be global, i.e. the keys in the first bucket are less than the keys in the second bucket, or is it sufficient that each bucket is sorted?

Was it helpful?

Solution

you must define the tables to be CLUSTERED BY same column and SORTED BY same column in the same order INTO same amount of buckets.
Then, you must set the above settings as you had listed AND write the hint /*+MAPJOIN(x)*/ where x is one of the tables.
Also, both tables must be joined AS IS in the join clause and you can't use any in a sub-query before the join because the data wont be buckets and sorted after the sub-query which happens first.
Finally, the join columns must be the ones the tables are bucketed/sorted on.

When you insert the data into the tables you can either use hive.enforce.sorting setting (set to true) or manually write the sort command.
Hive doesn't check that the buckets are actually sorted so if they aren't this might cause wrong results in the output.

Each mapper will read a bucket from the first table and the corresponding bucket from the 2nd and it will perform a Merge-sort join.

To your question - No they don't have to globally sorted.

P.S.
You should issue the EXPLAIN command before running the query and you'll see if hive plans to do a Merge-sort bucket join or no.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top