Question

I have noticed one thing if you create an external table with EXTERNAL keyword pointing to some S3 bucket location. Data get loaded and you can query it. But, even if i don't use EXTERNAL keyword and use the rest of the script for creating the table. Even then it works perfectly. Why is it so? Also, is it a external table or interntal table? If, i delete this table will the metadata be deleted or data also? Is there any Significance of term EXTERNAL?

If i create a table with EXTERNAL keyword and location and another table without EXTERNAL keyword but with location. What is the difference, when i am getting the same behaviour?

create table dummy(id int, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
    STORED AS TEXTFILE
    LOCATION 's3n://logs/july';

                 VS

create external table dummy(id int, value string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
        STORED AS TEXTFILE
        LOCATION 's3n://logs/july';
Was it helpful?

Solution

When you use Location with a managed table, its purpose is to create a new directory dedicated to this table, which would otherwise go to the default location, /user/hive/warehouse/. So, when you do :

create table dummy(id int, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
    STORED AS TEXTFILE
    LOCATION 's3n://logs/july';

It'll create a new directory called s3n://logs/july for the table dummy instead of creating a directory dummy inside /user/hive/warehouse/.

Same thing will happen if you use Location with an external table for a non-existing location. While creating an external table if you give a directory location which exists then all the files inside this directory will collectively constitute the data of this table. And if the directory does not exists, then you'll see the same behavior as with managed tables, i.e. table will get created with a blank directory at the location same as the location specified by you while issuing the create command. Say, you do :

create external table dummy(id int, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
        STORED AS TEXTFILE
        LOCATION 's3n://logs/july';

If s3n://logs/july exists then the table dummy will get created having data form the files present under s3n://logs/july. If s3n://logs/july doesn't exist then immediately after issuing the create command you'll see a brand new directory at this very same location. And if you delete this table, the directory s3n://logs/july will remain as it is, blank though(because you had pointed it to a non-existing location while table creation).

If you delete a managed table, even if it was created with Location clause the directory represented by Location will get deleted. I don't know how you are getting same behavior in both the cases. Try this and let me know what do you observe.

BTW, SO is a place to share knowledge and thoughts with spirit. And not to get hyper or to take stuff personally. If you disagree with somehitng, there is a proper way to put your point. And if it was just about a downvote, just let me know. I'll upvote all your questions and answers.

OTHER TIPS

External means Folder path is not part of hive warehouse (/usr/hive/warehouse).

when you use Drop command, external table drops only schema, where as managed table even drops the data.

External table is useful for having daily transnational data as part of sql query output without any extra effort.

read this also

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top