Hadoop Hive: How to Select FROM tableX m1, tableX m2?

https://stackoverflow.com/questions/23339199

10-07-2023
|

Question

How to write a Hive SQL query that for a table with two fields id, val will return groups of id-s where all id-s in a group have one and the same val?

The following query:

SELECT DISTINCT m1.id,
  m2.id
FROM tableX m1,
  tableX m2
WHERE m1.id <> m2.id
AND m1.val   = m2.val;

fails with:

FAILED: ParseException line 1:42 
cannot recognize input near 'm1' ',' 
'match' in table source

Solution

If you just want to look at those groups, you can simply write:

select id, val
from tableX
order by val;

If you need to have those groups separated as different object, then you can use something like:

select val, collect_set(id)
from tableX
group by val;

Or (if the val is not important)

select collect_set(id)
from tableX
group by val;

Those queries will produce arrays of id's with no duplicates. collect_list (as of hive 0.13) is for aggregating collections with duplicates.

OTHER TIPS

Use aggregation, not a join for this:

SELECT m.val
FROM tableX m
GROUP BY m.val
HAVING MIN(m.id) = MAX(m.id);

The HAVING clause could also be:

HAVING COUNT(DISTINCT id) = 1

But COUNT(DISTINCT) is usually more computationally intensive than MIN()/MAX().

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow