What is the best way to produce large results in Hive

https://stackoverflow.com/questions/20709756

20-09-2022
|

Question

I've been trying to run some Hive queries with largish result sets. My normal approach is to submit a job through the WebHCat API, and read the results from the resulting stdout file, or to just run hive at the console and pipe stdout to a file. However, with large results (more than one reducer used), the stdout is blank or truncated.

My current solution is to create a new table from the results CREATE TABLE FROM SELECT which introduces an extra step, and leaves the table to clear up afterwards if I don't want to keep the result set.

Does anyone have a better method for capturing all the results from such a Hive query?

Solution

You can write the data directly to a directory on either hdfs or the local file system, then do what you want with the files. For example, to generate CSV files:

INSERT OVERWRITE DIRECTORY '/hive/output/folder'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT ... FROM ...;

This is essentially the same as CREATE TABLE FROM SELECT but you don't have to clean up the table. Here's the full documentation:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow