Question

This is type of query I am sending to HIVE:

SELECT BigTable.nicefield,LargeTable.* 
FROM LargeTable INNER JOIN BigTable 
    ON (
        LargeTable.joinfield1of4 = BigTable.joinfield1of4 
        AND LargeTable.joinfield2of4 = BigTable.joinfield2of4 
    )   
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15 AND BigTable.joinfield3of4=20140726 AND BigTable.joinfield4of4=15
    AND LargeTable.filterfiled1of2=123456
    AND LargeTable.filterfiled2of2=98765
    AND LargeTable.joinfield2of4=12 
    AND LargeTable.joinfield1of4='iwanttolikehive'       

It returns 2418025 rows. The issue is that

SELECT *  
FROM LargeTable 
WHERE joinfield3of4=20140726 AND joinfield4of4=15
    AND filterfiled1of2=123456 
    AND filterfiled2of2=98765
    AND joinfield2of4=12 
    AND joinfield1of4='iwanttolikehive'

returns 1555 rows, and so does:

SELECT *  
FROM BigTable 
WHERE joinfield3of4=20140726 AND joinfield4of4=15
    AND joinfield2of4=12 
    AND joinfield1of4='iwanttolikehive'

Note that 1555^2 = 2418025.

Was it helpful?

Solution

It turns out that the correct version of the query should be:

SELECT bt.nicefield,LargeTable.* 
FROM LargeTable INNER JOIN 
    (
    SELECT nicefield, joinfield1of4,joinfield2of4, count(*) as rows
    FROM BigTable
    WHERE joinfield3of4=20140726 ANDjoinfield4of4=15
    GROUP BY nicefield, joinfield1of4,joinfield2of4
    ) bt 
    ON (
        LargeTable.joinfield1of4 = bt.joinfield1of4 
        AND LargeTable.joinfield2of4 = bt.joinfield2of4 
    )   
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15
    AND LargeTable.filterfiled1of2=123456
    AND LargeTable.filterfiled2of2=98765
    AND LargeTable.joinfield2of4=12 
    AND LargeTable.joinfield1of4='iwanttolikehive'

The issue is that in the original query, the join on the BigTable was returning duplicates.

This is NOT an issue, queries just have to be written carefully!!! I hope this help!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top