Why are the results of a hive query split into multiple files

https://stackoverflow.com/questions/7927050

15-02-2021
|

Question

I have a Amazon ElasticMapreduce job set up to run hive query

CREATE EXTERNAL TABLE output_dailies (
day string, type string, subType string, product string, productDetails string, 
uniqueUsers int, totalUsers int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}';

INSERT OVERWRITE TABLE output_dailies
select day, type, subType, product, productDetails, count(distinct accountId) as uniqueUsers, count(accountId) as totalUsers from raw_logs where day = '${QUERY_DATE}' group by day, type, subType, product, productDetails;

After the job finishes, the output location, which is configured to be on S3, will contain 5 files with this pattern task_201110280815_0001_r_00000x where x goes from 0 to 4. The files are small, 35 KB each.

Is it possible to instruct hive to store the results in a single file?

Solution

They are created by different data nodes. Each one is appending to the file - if they all had to append to the same file then this would require lots of locking and slow it down.

You can treat the multiple files as one big file by just referring to the directory and all its contents.

OTHER TIPS

In general term yes this is achievable but with a loss of some scalability

Try using the setting

"set mapred.reduce.tasks = 1;"

This forces 1 reducer and therefore there will be only 1 file outputted.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow