Is Hive external table data distributed to data nodes in the same way as internal tables?

StackOverflow https://stackoverflow.com/questions/20844914

  •  22-09-2022
  •  | 
  •  

Frage

I can't find reference information to explain certain details of Hive external tables. When a file located outside the default data warehouse is loaded to an external table (using LOCATION), is the data ingested and distributed among the data nodes as is the case with internal tables -- and the file used as the source remains intact in the file system, which essentially duplicates the data?

War es hilfreich?

Lösung

If the data are already in HDFS, there is no duplication. An EXTERNAL table points to any HDFS location for its storage...

Andere Tipps

"External" means external to the default directory that Hive is using to store data (e.g., on Hortonworks it is /apps/hive/warehouse). It doesn't mean it is on the local filesystem - it must be on HDFS, on the same Hadoop cluster that Hive is pointing to.

Since it is HDFS data, Hive queries on it is treated exactly the same as if you had written a mapreduce job operating on that data directly. That is to say, it is not copied to /apps/hive/warehouse before hive will operate on it. Functionally the only difference is that if you DROP TABLE an external table, the data is not deleted from HDFS. Other than that, everything else works exactly the same for internal vs external tables.

An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir(Location of default database for the warehouse).

The data is not replicated when you make an external table. So, when dropping an EXTERNAL table, data in the table is not deleted from the file system.

Note: Even in case of internal tables data is NOT distributed among the nodes, the data is replicated based on the replication factor.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top