Question

I have file on HDFS with 78 GB size

I need to create an Impala External table over it to perform some grouping and aggregation on data available

Problem The file contain headers.

Question Is there any way to skip headers from file while reading the file and do querying on the rest of data.

Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large

Please suggest if anyone have any idea...

Any suggestions will be appreciated....

Thanks in advance

Was it helpful?

Solution

UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as

  • Load data file into a temporary Hive/Impala table
  • Use INSERT INTO or CREATE TABLE AS on temp table to create require table

OTHER TIPS

A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.

A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top