Optimised Hive query with JOIN , having million records

https://stackoverflow.com/questions/23492638

hadoop
hive

16-07-2023
|

题

I have 2 tables-

bpm_agent_data  - 40 Million records , 5 Columns
bpm_loan_data  - 20 Million records, 5 Columns

Now I ran a query in Hive-

select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data where bpm_loan_data.id = bpm_agent_data.id;

which is taking long long time to complete. What should be the ideal way to write the query in HIVE so that Reducer must not take so much time.

解决方案

Found the solution for the above query, replaced where with ON

select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data ON( bpm_loan_data.id = bpm_agent_data.id);

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow