A simple explanation is:
When you are executing a simple select * from tab1 limit 3
query in Hive, it access the raw data files from HDFS and presents the output like a view on top of the files stored in HDFS basically dfs -cat 'filepath'
. A Map Reduce job is not triggered in this case hence completing the job faster. If you modify your query to even pull on column like select col1 from tab1 limit 3
, the Map Reduce job is triggered and the part files are scanned to pull out the results parallely thus consuming some Cumulative CPU Time.
The same thing happens when you hit a query like INSERT OVERWRITE LOCAL DIRECTORY "/tmp/query1/" select * from tab1 limit 3;
In order to find out more on how Hive translates queries into Map Reduce Jobs, you can use the EXPLAIN
keyword before your SELECT
keyword. This should make things more clear to you.