Can Hadoop reduce runtime of SIFT?

https://stackoverflow.com/questions/23652779

22-07-2023
|

Pergunta

Can we use hadoop to run SIFT on multiple images?

SIFT takes ~ 1s on each image to extract keypoints and its descriptors. Considering that each run is independent of others and runtime of 1 run cannot be reduced, can we reduce runtime anyhow?

Multithreading reduces runtime by a factor of number of core processors you have. We can run each image on each processor.

Can hadoop be used anyhow to parallelize run on multiple images?
If yes, by what factor can it reduce runtime supposing we have 3 clusters?

Solução

Yes, Hadoop can be used to extract SIFT descriptors from multiple images. Here is an example of SIFT descriptor extraction for Hadoop using OpenIMAJ.

Hadoop will process images in parallel on all cluster nodes. But potential speedup depends on the size of image dataset. If the size is small the runtime may increase due to Hadoop overhead.

You can experience two problems.

Copying images to HDFS can be slow. It may by faster to process all images on one computer then copy them to HDFS and process on 3-node cluster. It depends on the size of dataset and numbers of nodes in the cluster.
Typically image size is small compare to HDFS block size (64M by default). Hadoop works bad with such files (see Cloudera blog). You can use Hadoop sequence files to combine many small image files to one large file. OpenIMAJ contains SequenceFileTool, which can be used for this purpose.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow