Question

Are there end-of-exif / end-of-xmp / end-of-iptc / start-of-data markers that I could use to get a checksum of just the data part of a jpg / jpeg (and other image formats)?

Was it helpful?

Solution 6

MediaTags has checksum support for JPEG, MP3, M4A, etc

OTHER TIPS

I think this question is related to this one Compute hash of only the core image data (excluding metadata) for an image, https://stackoverflow.com/a/10075170/890106 gives an element of answer if you're looking for code.

It might not works with all JPG variants though : some of them can embed multiple images (MPF / CIPA Multi-Picture Format, more informations at http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/MPF.html) and you might still have some metadata. Also, some software put an UID in the form of --[0-9A-F]+-- at the end of the file and it shouldn't be read. Safest solution if probably to checksum pixels (though you can still have influence of orientation, color profile, ..).

One easy way to get a hash sum of just the pixel data would be to convert the JPEG into a 32Bit BMP or alternatively into PNG and to calculate a hashsum from that. This will strip all the associated information from the JPEGs and would even match JPEGs with differnt encodings that lead to the same pixel data. You could of course also use the in memory pixel data of the resulting BMPs directly if you have it (i.e. Windows has several API functions to get it from any supported image type).

Yes to jpeg and exif, I don't know to the others.

The JPEG spec that I have is called JFIF (JPEG File Interchange Format) it comes from Annex B of ISO 10918-1 and like all ISO specs, it takes careful reading to figure out how to translate the spec into data structures. I think this is much easier to follow

the EXIF format parses much like the TIFF format. each chunk has a type and a size, so you just walk the chunks until you get to the image data chunk. it has a pointer to the image data (actually pointers to strips, but I'm pretty sure that you can assume the everything after the first strip of image data to the end of the file is image data.

The exif format has its own website

You'll have to look at each format. For JPEG, it looks like the structure implies that you can just do a checksum of the sections that start with FFEn (e.g. 0xFFE1) and checksum the bytes specified after each marker (It looks like the length follows the marker and is 2 bytes in big-endian format). For more details, see here.

Since you want to do this for various image formats, you should just use a general-purpose image decompression library and run your checksum on the uncompressed data. This will allow you to match identical images even if they are encoded differently on disk.

If you want to limit yourself to JPEG, you can checksum the data between SOI and EOI. This answer can be slightly adapted to do what you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top