Question

I have three csv files in one directory, and each file has different schema. I need to create one hive table that can combine data from all three files.

Schemas for each file are as follows

/example/test1.csv -- C1, C2, C3, C4
/example/test2.csv -- C1, C2, C3, C4, C5
/example/test3.csv -- C1, C2, C6, C3, C4, C5

Here, can I create one hive table with schema C1, C2, C3, C4, C5, C6?

Was it helpful?

Solution

I think you want a view which is a union of the different tables, if I understand correctly. @venBigData's solution doesn't quite work because it will interpret c6 as c3.

Something like

CREATE VIEW union_view
AS 
SELECT c1, c2, c3, c4, null as c5, null as c6
  FROM test1
UNION ALL
SELECT c1, c2, c3, c4, c5 , null as c6 
  FROM test2
UNION ALL
SELECT c1,c2, c4 as c3, c5 as c4, c6 as c5, c3 as c6
  FROM test3;

Is that what you meant?

OTHER TIPS

Note: Hive is schema on read. The table would be created. You could create a table like

CREATE EXTERNAL TABLE tab3(
c1 int,
c2 int,
c3 int,
c4 int,
c5 int,
c6 int)
row format delimited
fields terminated by ','
location '/example'

This table gets created. With the test1.csv, test2.csv, test3.csv as its datasources.

When you try to query the table note that your c3 in test1.csv, c3 in test2.csv and c6 in test3.csv will all be the same column in the hive table. Similarly for C4, c4 and c3 and so on.

Is that what you were looking for?. the columns would be filled with null for those files which do not have data for like c5 will be null for test1.csv and c6 will be null for test2.csv.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top