Question

How to write a Hive SQL query that for a table with two fields id, val will return groups of id-s where all id-s in a group have one and the same val?

The following query:

SELECT DISTINCT m1.id,
  m2.id
FROM tableX m1,
  tableX m2
WHERE m1.id <> m2.id
AND m1.val   = m2.val; 

fails with:

FAILED: ParseException line 1:42 
cannot recognize input near 'm1' ',' 
'match' in table source
Was it helpful?

Solution

If you just want to look at those groups, you can simply write:

select id, val
from tableX
order by val;

If you need to have those groups separated as different object, then you can use something like:

select val, collect_set(id)
from tableX
group by val;

Or (if the val is not important)

select collect_set(id)
from tableX
group by val;

Those queries will produce arrays of id's with no duplicates. collect_list (as of hive 0.13) is for aggregating collections with duplicates.

OTHER TIPS

Use aggregation, not a join for this:

SELECT m.val
FROM tableX m
GROUP BY m.val
HAVING MIN(m.id) = MAX(m.id);

The HAVING clause could also be:

HAVING COUNT(DISTINCT id) = 1

But COUNT(DISTINCT) is usually more computationally intensive than MIN()/MAX().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top