Question

I am new to Hadoop and learning Hive.

In Hadoop definative guide 3rd edition page no. 428 last paragraph

I don't understand below paragraph regarding external table in HIVE.

"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."

Can anybody explain briefly what above phrase says?

Was it helpful?

Solution

Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.

OTHER TIPS

The key difference between external and managed table in Hive is that data in the external table is not managed by Hive. When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.

Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).

In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc. In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.

Another goal of external table is to use as a source data from remote servers, e.g. in csv format.

There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.

The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:

  • ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed tables
  • DROP deletes data for managed tables while it only deletes metadata for external ones
  • ACID/Transactional only works for managed tables
  • Query Results Caching only works for managed tables
  • Only the RELY constraint is allowed on external tables
  • Some Materialized View features only work on managed tables

You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344

Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type

It is also mentioned in the book.

when your table is external table.

 you can use other technologies like PIG,Cascading or Mapreduce to process it .

You can also use multiple schemas for that dataset.

and You can also create data lazily if it is external table.

when you decide that dataset should be used by only Hive,make it hive managed table.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top