Loading data into Hive/Impala

https://stackoverflow.com/questions/23117410

04-07-2023
|

Question

What is the best method to load the incremental data into hive/impala table?

I followed the below steps but could n't succeed.

From MR program the output has been written into temp folder in HDFS.
Defined a external temp table pointing to temp folder (output folder of MR)
Added load statement from temp table to my target table.
drop the temp table & also the temp folder in HDFS.

The above method is working fine when I run the oozie work flow sequentially. When I invoke multiple jobs at the same time its been hanging while loading the data.

I can't run the loading of data in sequence. Any help in making it more effective so that I can run parallel jobs which will load data at the same time.

Solution

In our case, the incremental data goes into a new partition in Hive table every time. So, in step 3 (in the above mentioned steps), we simply add a new partition to the table.

In case of multiple workflows working in parallel, if each of them loads data into a new partition, it should work fine.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow