If I store all my images in SequenceFile, how I can I design mapper to process a selection of them?

StackOverflow https://stackoverflow.com//questions/22057078

  •  22-12-2019
  •  | 
  •  

Question

I do have lots of image files and need to store them in HDFS, in order to avoid the Small Files Problem, I am planning to store my image files using Sequence Files.

My problem is that I need to create a MapReduce program that processes only a selection of those files, I don't think it is a good idea to read all of the images content from the SequenceFile if I am only planning to process a few of them, also, more images can be added , if I create a new SequenceFile for each bunch of images, how would I know which SequenceFile contains the images I need to process?. In case I knew it would be overwhelming to filter manually the images before making input to mapper.

Please advice. Thanks!

Was it helpful?

Solution

If you can store your files in MapFile which is SequenceFile with an index, you can use MapFile.Reader to query some file by the key. For example,

MapFile.Reader reader = MapFile.Reader(fs, dirName, conf);


public byte[] get(String filename) {
    TextWritable key = new TextWritable();
    BytesWritable value = new BytesWritable();
    if(reader.get(key,value) != null) {
        return value.copyBytes();
    }
    else {
        return null;
    }
}

If you files are generated by a MapReduce application, you can use MapFileOutputFormat to output MapFile.

In addition, since you only need to process a few files, I think your don't need MapReduce in such process.

OTHER TIPS

You could store the image files in HBase along with any other attributes of the images - that you may want to filter/query on. This will allow you to selectively query for images.

See this:
http://apache-hbase.679495.n3.nabble.com/Storing-images-in-Hbase-td4036184.html
http://www.slideshare.net/jacque74/hug-hbase-presentation

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top