handle corrupted gzip files in hadoop / hive

https://stackoverflow.com/questions/19523724

01-07-2022
|

Question

I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.

I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.

Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted? I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.

It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.

Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)

Is there a cleaner / more elegant solution that someone has found?

Thanks for any help.

Solution

I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader which would catch IOExceptions, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.

I've written up some details (including code for the custom Record Reader here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

OTHER TIPS

I see essentially two ways out:

You create a patch for Hadoop that allows this kind of handling of corrupted files and then simply run the applications against the corrupted files.
You create a special hadoop application that uses your own custom 'gunzip' implementation (that can handle these kinds of problems). This application then simply reads and writes the files as a mapper only job (Identity mapper). The output of this job is then used as input for your normal mapreduce/pig/hive/... jobs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow