Hive (0.12.0) - Load data into table with partition, buckets and attached index

https://stackoverflow.com/questions/22699327

22-06-2023
|

Question

Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.

I have a working solution but something tells me it is very inefficient.

Here is what I do: I load my data in a "flat" intermediate table (no partition, no buckets):

LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;

Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:

FROM my_flat_table
  INSERT OVERWRITE TABLE final_table
  PARTITION(date)
  SELECT
  col1, col2, col3, to_date(my_date) AS date;

The bucketing was defined earlier when I created my final table:

 CREATE TABLE final_table
   (col1 TYPE1, col2 TYPE2, col3 TYPE3)
   PARTITIONED BY (date DATE)
   CLUSTERED BY (col2) INTO 64 BUCKETS;

And finally, I create the index on the same column I use for bucketing (is that even useful?):

CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';

All of this is obviously really slow, so how would I go about optimizing the loading process?

Thank you

Solution

Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.

However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.dynamic.partition=true; 
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;

I am sure you must be using the first two, and the last two you can set depending on your data size.

You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.

I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow