I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader
which would catch IOExceptions
, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.
I've written up some details (including code for the custom Record Reader
here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/