Software chain to find duplicate images

https://stackoverflow.com/questions/20802211

22-09-2022
|

Question

What I'm trying to achieve

I'm looking for a software chain to find duplicate images. First, here's how I define a duplicate image : There's an original image, coming directly from a camera, and modified version(s) of this image. Modifying the image can be any or a combination of the following operations:

Changing brightness, contrast, coloring (a modified version of the image could be in Black & White)
Cropping
Resizing
Rotating
Adding a frame around the image
Writing on the frame

A real world example:

The original image

Luminosity + brightness change + resize Modified version #1

Cropping Modified version #2

Frame + text Modified version #3

Matching a pair of any of the images above should result in finding a duplicate. As you can see, the modification is not intended to be destructive, but rather ameliorative. For instance, the main subject of the image (here, the alarm clock) will never be cropped in its middle.

The modification can be chained (a new modification can be based on a previous modification rather than on the original image), resulting in an image to be compressed a lot of times.

Then, the photographer can take another image:

A brand new image

The viewpoint and the main subject have changed (it's now 0:02!) => when compared to any of the images above, this new image should not be considered as a duplicate.

What I was doing so far

#1 : getting rid of frames

First of all, I'm using OpenCV's Canny Detector + Hough algorithm to find vertical and horizontal lines on the image. Then, I crop the picture according to the lines the algorithm found.

Problem I've been facing with that solution: when there are horizontal or vertical lines in the original picture's background, it's hard to distinguish which lines are from the frame, which one are from the picture => manual review.

I've also set up a higher thresold to avoid getting too many false positive: unfortunately, some elaborate frames (with a gradient, for instance) go through.

Is there a better algorithm to detect these frames?

#2 : finding duplicate

I've been using pHash and its DCT image hash so far. It computes a visual hash, and provides a very efficient way to search for similar images in a large database.

Advantages :

It's very fast
You can search through thousands of images
It works good enough with all of my criteria (cropping, resizing, re-compressed images, rotation)

Disadvantages :

Many false positive
Find duplicates for images that have been taken from completely different pointviews
Can miss some duplicates when images had a combination of modifications

All of the duplicate pHash finds end up in manual review as well. That's not a problem, except when the input data is thousands of images of the same subject. The number of duplicates to review then grows quadratically, which is not very convenient.

Ideas on how to improve the duplicate detection

I've been digging around on how to reduce the number of false positive from pHash. My first idea was adding OpenCV's template matching to my existing software chain. Problem : it wouldn't work for rotated images.

Then, I learned about feature detection, and I thought this might be the way to go. However, this is a very vast field and this is where I need help.

I found at page 81 of this PDF an interesting comparison of feature detectors. If I get it right, I need "Rotation invariant", "Scale invariant" but not "Affine invariant" (which seems to be a change in the viewpoint). This would be give me the following options:

Harris-Laplace
Hessian-Laplace
DoG
SURF

Would these algorithms answer my needs? Should I integrate them in my existing chain or should I start over a new chain? Feature detection to duplicate matching seems a long way to go, what would be the best approach?

Solution

You should take the local feature matching approach (SURF/ORB/BRISK...) You can find a nice tutorial here:http://docs.opencv.org/doc/tutorials/features2d/feature_flann_matcher/feature_flann_matcher.html If efficiency is very important, you can replace OpenCV's findHomography with a custom find-rigid-transform code, but if it is not a big issue findHomography will probably serve you well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow