Question

I have a folder of input bz2 files, and some of them might be corrupted, I want to remove all the corrupted/invalid bz2 files before running my MR job. What's a good way of doing it?

Was it helpful?

Solution

use bzip2 -t to test if bzip file is corrupted or not. If it's corrupted, i think you can see something like this.

bzip2: test1.txt: bad magic number (file not created by bzip2)
bzip2: 2: bad magic number (file not created by bzip2)

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

So if your files are in your local file system, using some shell script based on the point mentioned above should work. If you files are already on HDFS, then use Hadoop streaming with mapper as a script to output corrupted files and no reducer, or reducer to delete or post process those files.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top