Domanda

I have some structured data with 10 fields in hadoop cluster. I need to update this data into multiple tables of hive with subset of the fields present in the original data.

E.g: Lets say the data has fields F1, F2, F3 ... F10. I have table B1 containing fields F1, F2. Table B2 containing F1, F3. Table B3 containing F2, F4.

I would like to know if there is a way to populate the data in tables B1, B2, B3 without reading the data in A multiple times.

FYI: I am de-normalizing the data by spitting into multiple tables as these tables will be having millions of records and I have aggregate them later for generating some statistics.

Currently, I am thinking of the below approach. Read the data using a map reduce job and using multiple output format write the different key combinations to multiple files and load the data from these files to hive tables. Please suggest if there is any better way to do this.

È stato utile?

Soluzione

Hive has multi-table inserts:

FROM A
INSERT INTO TABLE B1 SELECT F1, F2
INSERT INTO TABLE B2 SELECT F1, F3
INSERT INTO TABLE B3 SELECT F2, F4;
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top