Question

I have a project that requires me to process a lot (1000-10000) of big (100MB to 500MB) images. The processing I am doing can be done via Imagemagick, but I was hoping to actually do this processing on Amazon's Elastic MapReduce platform (which I believe runs using Hadoop).

Of all of the examples I have found, they all deal with text-based inputs (I have found that Word Count sample a billion times). I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

I am pretty sure this can be done with this platform, and should be able to be done using Bash; I don't think I need to go to the trouble of creating a whole Java app or something, but I could be wrong.

I'm not asking for someone to hand me code, but if anyone has sample code or links to tutorials dealing with similar issues, it would be much appreciated...

Was it helpful?

Solution

There are several problems with your task.

Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.

But how do you deal with data locality?

You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.

Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.

Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].

and should be able to be done using Bash

That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder to call ImageMagick with a specific path and function.

I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

Guess why? :D Hadoop is not the right thing for this task.

So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it. It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html

OTHER TIPS

I would think you could look at the example in "Hadoop: The Definitive Guide" 3rd Edition. Appendix C outlines a way, in bash, to get a file (in hdfs), unzip it, create a folder, create a new file from those files in the unzipped folder and then put that file into another hdfs location.

I customized this script myself so that the initial hadoop get is a curl call to a webserver hosting the input files I need - I didn't want to put all the files in hdfs. If your files are already in hdfs then you can use the commented out line instead. The hdfs get or the curl will ensure the file is available locally for the task. There's lot of network overhead in this.

There's no need for a reduce task.

Input file is a list of the urls to files for conversion/download.

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is Isotropic Url
read offset isofile

# Retrieve file from Isotropic server to local disk
echo "reporter:status:Retrieving $isofile" >&2
target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`
filename=$target.tar.bz2
#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename
curl  $isofile -o $filename

# Un-bzip and un-tar the local file
mkdir -p $target
echo "reporter:status:Un-tarring $filename to $target" >&2
tar jxf $filename -C $target

# Take the file and do what you want with it. 
echo "reporter:status:Converting $target" >&2
imagemagick convert .... $target/$filename $target.all

# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The New York Times processed 4TB of raw image data into pdfs in 24 hours using Hadoop. It sounds like they took a similar approach: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?scp=1&sq=self%20service%20prorated&st=cse. They used the java api, but the rest is get the file locally, process it and then stick it back into hdfs/sc3.

You can take a look at CombineFileInputFormat in Hadoop, which can implicitly combine multiple files and split it, based on the files.

But I'm not sure how you gonna process the 100M-500M images, as it's quite big and in fact larger than the split size of Hadoop. Maybe you can try different approaches in splitting one image into several parts.

Anyway, good luck.

I've been looking for solutions to deal with large scale remote sensing image in Hadoop for a long time. And I got nothing till now!

Here is a open source project about spliting the large scale image into samller ones in Hadoop. I had read the code carefully and tested them. But I found that the performances are not as good as expectation. Anyway, it may be helpful and shed some light on the problem.

Project Matsu: http://www.cloudbook.net/directories/research-clouds/research-project.php?id=100057

Good luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top