Question

What is the best method to load the incremental data into hive/impala table?

I followed the below steps but could n't succeed.

  • From MR program the output has been written into temp folder in HDFS.
  • Defined a external temp table pointing to temp folder (output folder of MR)
  • Added load statement from temp table to my target table.
  • drop the temp table & also the temp folder in HDFS.

The above method is working fine when I run the oozie work flow sequentially. When I invoke multiple jobs at the same time its been hanging while loading the data.

I can't run the loading of data in sequence. Any help in making it more effective so that I can run parallel jobs which will load data at the same time.

Was it helpful?

Solution

In our case, the incremental data goes into a new partition in Hive table every time. So, in step 3 (in the above mentioned steps), we simply add a new partition to the table.

In case of multiple workflows working in parallel, if each of them loads data into a new partition, it should work fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top