質問

I have two hive queries

  1. select * from tab1 limit 3;

    This returns the 3 rows quickly without launching any map reduce jobs;

  2. Now the same Query if i ask to write the output to a local directory as

    `INSERT OVERWRITE LOCAL DIRECTORY "/tmp/query1/" select * from tab1 limit 3;
    

    This Query launches a map reduce job that scans through all the files of the table and then returns 3 rows and the table under question is a big one so scanning through the whole thing takes a long time.

Why is there a difference in execution style of both queries?

役に立ちましたか?

解決

A simple explanation is:

When you are executing a simple select * from tab1 limit 3 query in Hive, it access the raw data files from HDFS and presents the output like a view on top of the files stored in HDFS basically dfs -cat 'filepath' . A Map Reduce job is not triggered in this case hence completing the job faster. If you modify your query to even pull on column like select col1 from tab1 limit 3, the Map Reduce job is triggered and the part files are scanned to pull out the results parallely thus consuming some Cumulative CPU Time.

The same thing happens when you hit a query like INSERT OVERWRITE LOCAL DIRECTORY "/tmp/query1/" select * from tab1 limit 3;

In order to find out more on how Hive translates queries into Map Reduce Jobs, you can use the EXPLAIN keyword before your SELECT keyword. This should make things more clear to you.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top