Question

I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.

Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?

Was it helpful?

Solution

The GNU coreutils split could do the work.

If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:

hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

or with --line-bytes option:

hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

OTHER TIPS

My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.

I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.

You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp

This is essentially a map-reduce technique for merging small files.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top