How do people prove the correctness of Computer Vision methods?

https://stackoverflow.com/questions/10050300

29-05-2021
|

Question

I'd like to pose a few abstract questions about computer vision research. I haven't quite been able to answer these questions by searching the web and reading papers.

How does someone know whether a computer vision algorithm is correct?
How do we define "correct" in the context of computer vision?
Do formal proofs play a role in understanding the correctness of computer vision algorithms?

A bit of background: I'm about to start my PhD in Computer Science. I enjoy designing fast parallel algorithms and proving the correctness of these algorithms. I've also used OpenCV from some class projects, though I don't have much formal training in computer vision.

I've been approached by a potential thesis advisor who works on designing faster and more scalable algorithms for computer vision (e.g. fast image segmentation). I'm trying to understand the common practices in solving computer vision problems.

La solution

In practice, computer vision is more like an empirical science: You gather data, think of simple hypotheses that could explain some aspect of your data, then test those hypotheses. You usually don't have a clear definition of "correct" for high-level CV tasks like face recognition, so you can't prove correctness.

Low-level algorithms are a different matter, though: You usually have a clear, mathematical definition of "correct" here. For example if you'd invent an algorithm that can calculate a median filter or a morphological operation more efficiently than known algorithms or that can be parallelized better, you would of course have to prove it's correctness, just like any other algorithm.

It's also common to have certain requirements to a computer vision algorithm that can be formalized: For example, you might want your algorithm to be invariant to rotation and translation - these are properties that can be proven formally. It's also sometimes possible to create mathematical models of signal and noise, and design a filter that has the best possible signal to noise-ratio (IIRC the Wiener filter or the Canny edge detector were designed that way).

Many image processing/computer vision algorithms have some kind of "repeat until convergence" loop (e.g. snakes or Navier-Stokes inpainting and other PDE-based methods). You would at least try to prove that the algorithm converges for any input.

Autres conseils

You just don't prove them.

Instead of a formal proof, which is often impossible to do, you can test your algorithm on a set of testcases and compare the output with previously known algorithms or correct answers (for example when you recognize the text, you can generate a set of images where you know what the text says).

This is my personal opinion, so take it for what it's worth.

You can't prove the correctness of most of the Computer Vision methods right now. I consider most of the current methods some kind of "recipe" where ingredients are thrown down until the "result" is good enough. Can you prove that a brownie cake is correct?

It is a bit similar in a way to how machine learning evolved. At first, people did neural networks, but it was just a big "soup" that happened to work more or less. It worked sometimes, didn't on other cases, and no one really knew why. Then statistical learning (through Vapnik among others) kicked in, with some real mathematical backup. You could prove that you had the unique hyperplane that minimized a particular loss function, PCA gives you the closest matrix of fixed rank to a given matrix (considering the Frobenius norm I believe), etc...

Now, there are still a few things that are "correct" in computer vision, but they are pretty limited. What comes to my mind is the wavelet : they are the sparsest representation in an orthogonal basis of function. (i.e : the most compressed way to represent an approximation of an image with minimal error)

Computer Vision algorithms are not like theorems which you can prove, they usually try to interpret the image data into the terms which are more understandable to us humans. Like face recognition, motion detection, video surveillance etc. The exact correctness is not calculable, like in the case of image compression algorithms where you can easily find the result by the size of the images. The most common methods used to show the results in Computer Vision methods(especially classification problems) are the graphs of precision Vs recall, accuracy Vs false positives. These are measured on standard databases available on various sites. Usually the harsher you set the parameters for correct detection, the more false positives you generate. The typical practice is to choose the point from the graph according to your requirement of 'how many false positives are tolerable for the application'.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow