Question

I'm putting together a script to find remove duplicates in a large library of images. At the moment I'm doing a two pass filter of first finding files of the same size and then doing a sha256 on a 10240 byte piece of the file to get a fingerprint of the files with the same size (code here).

It works well, but I'm guessing there are probably checksums built in to the jpeg format that I could use instead of doing the sha256.

Does anyone know if there are checksums or other components that could act as checksums / fingerprints? If so, is there an efficient way to access them?

Was it helpful?

Solution

I don't think the JPEG specification includes any kind of checksum in the way you're describing.

A JPEG can contain a thumbnail as part of its EXIF metadata, though. It's not a perfect indicator, since it's possible for two different images to have the same thumbnail. There's at least one documented case of a thumbnail not being replaced after the image had undergone substantial modifications, said thumbnail revealing much more than the publisher had intended.

OTHER TIPS

Its been awhile since I've dug into the IJG library, but I don't think there's an easy class member or function call you can use there to check for some type of fingerprint. You could use the built in EXIF tags if you can control the encoding of the images...

I'm just built a very similar script. I don't want to checksum metadata I want to see if the actual images are duplicates even if tags have been modified. Best for that is not to sort by size, but do sort by the checksum istelf. I use jhead to remove metadata and then checksum the whole file (but I also thought about just doing part of it, but actually I don't think it saves much time). jhead doesn't use shared memory (pipes) and does overwrite so I just copy the file to shared memory first. I place the checksum in the ImageDescription field for later faster retrieval. Obviously this also allows to check image integrity later and is part of why I checksum the whole thing. Tip: exiv2 is MUCH faster for reading and writing the metadata than exiftool for one at a time decision based manipulation.

In JPEG standard(ITU-T.81) i believe there isn't any field/syntax element which has a checksum or such, for the whole compressed jpeg image file. Unless a customised application puts such filed in the Application segment, or as meta data for which segments are provided in the standard. So to serve your purpose, what you are doing is one soln. Other could be some kind a application wrapper which will call some binary file compare utlitiy (like beyond compare, or even a windows command fc /b) and check the result of that compare utility and take the decision u want to.

-AD

One way you could perform is reduce all images to a fixed size and store that as a thumbnail. Then the image comparison would compare similar sized images and give you a chance of being a duplicate - useful if you have cropped (unless cropped heavily) or resized images and want to find those 'duplicates'.

In the XMP specification there are document ID and version ID which should uniquely identify the version of the image.

The problem with these (and with any other metadata-based identification method) is that it might not be respected by some applications that can change the content of the jpeg updating the metadata accordingly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top