Question

I'm looking to create an ID system for cataloging images. I can't use md5() since that will change if I alter the EXIF tags of the image.

I am currently using the SHA1 checksum computed by imagemagick. It works perfectly, but it's really, really slow on larger images (~15 seconds on a quad-core xeon for a 21 megapixel JPG).

Are there any other "visual" methods of uniquely identifying an image that would be faster?

Was it helpful?

Solution

You could try running MD5 on the actual bitmap data instead of the JPEG file. I tested on my machine (also a quad core Xeon) and the following runs in about 900ms on a 23 megapixel image.

uint32_t width  = MagickGetImageWidth(imageWand);
uint32_t height = MagickGetImageHeight(imageWand);

uint8_t *imageData = malloc(width * height * 3);

MagickExportImagePixels(imageWand,
   0, 0, width, height, "RGB", CharPixel, imageData);

unsigned char *imageDigest = MD5(imageData, width * height * 3, NULL);

free(imageData);

OTHER TIPS

what do you mean by "visual checksum"? the algorithms you mention (md5/sha/crc) work in a byte based manner, but don't take into account the visual information of the image. If you convert one of your images to JPEG, the two files will show the same image, but have totally different md5/sha/crc checksums.

if your only worry are the exif edits, you could make a temporary copy of the image, strip all metadata from it with the exiv2 library and run the checksum algorithm then. I suppose this is much faster than manually scaling down the images. You could also speed up the calculation by using just the first n kilobytes of the source file for the checksum.

If all your image files come directly from a camera, you are even better off: you could extract the pregenerated exif thumbnail with exiv2 (usually just a few kilobytes) and calculate its checksum.

About the scale-down-approach: Also be aware of the fact that ImageMagick might change its scaling algorithmsin the future, which would invalidate your checksums (the byte structure of the scaled-down versions would change then).

As noted by Todd Yandell, MD5 is probably fast enough. If not, you can get something even faster by using a 32-bit or 64-bit CRC for your checksum. The major difference is that anybody can make up a new image with the same CRC; it is very easy to spoof. It is quite hard for someone to spoof an MD5 checksum. A minor difference is that the CRC has many fewer bits, but unless you have a very large number of images, a collision is still unlikely.

exiftool claims to be able to extract the binary image from a JPEG file, so that you can compute your checksum without decompressing, but I can't figure out from the man page how to do it.

I did some experiments on a laptop Intel Core 2 Duo L7100 CPU, and an 8MP JPEG takes about 1 second to compress to PPM format, then another 1 second to do the checksum. Checksum times were not dramatically different using md5sum, sum, and sha1sum. So your best bet might be to find a way to extract the binary data without decompressing it.

I also note that your checksum is going to be almost as good even if it uses far fewer pixels. Compare these two:

djpeg -scale 1/8 big.jpg | /usr/bin/sha1sum   # 0.70s
djpeg            big.jpg | /usr/bin/sha1sum   # 2.15s

You should consider that someone may crop the image or modify the pallette, color depth or anything, then a flat checksum will be different, even if visually the original and the modified image still look pretty much alike. Perhaps there is an effective algorithm for cropped or re-colored ones, like Google Images uses for searching of similar pictures.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top