Question

I have read several papers on using graph cuts for 3D reconstruction and I have noticed that there seem to be two alternative approaches to posing this problem.

One approach is volumetric and describes a 3D region of voxels for which a graph cut is used to infer a binary labelling (contains object of interest or does not) for each voxel. Papers which take this approach include Multi-View Stereo via Volumetric Graph Cuts and Occlusion Robust Photo-Consistency and A Surface Reconstruction Using Global Graph Cut Optimization.

The second approach is 2D and seeks to label each pixel of a reference image with the depth of the 3D point that projects there. Papers which take this approach include Computing Visual Correspondence with Occlusions via Graph Cuts.

I want to understand the advantages/disadvantages of each method and which are the most significant when choosing which method to use. So far I understand that some advantages of the first approach are:

  1. It is a binary problem, so is solvable exactly with Max-Flow algorithms.
  2. Provides simple methods of modelling occlusion.

And some advantages of the second approach are:

  1. Smaller neighbor set for each node of the graph.
  2. Easier to model smoothness (but does it give better results?).

Additionally, I would be interested in which situations I would be better off choosing one representation or the other and why.

Was it helpful?

Solution

The most significant difference is the type of scenes the algorithms are typically used with, and the way they represent the 3D shape of the object.

Volumetric approaches perform best...

  1. with a large number of images...
  2. taken from different viewpoints, well distributed around the object,...
  3. of a more or less compact "object" (e.g. an artifact, in contrast, for example, to an outdoor scene observed by a vehicle camera).

Volumetric approaches are popular for reconstructing "objects" (especially artifacts). Given sufficient views (i.e. images), the algorithms give a complete volumetric (i.e. voxel) representation of the object's shape. This can be converted to a surface representation using Marching Cubes or similar method.

The second type of algorithms you identified are called stereo algorithms, and graph cuts are just one of many methods of solving such problems. Stereo is best...

  1. if you have only two images...
  2. with a fairly narrow baseline (i.e. distance between cameras)

Generalizations to more than two images (with narrow baselines) exist, but most of the literature deals with the binocular (i.e. two image) case. Some algorithms generalize more easily to more views than others.

Stereo algorithms only give you a depth map, i.e. an image with a depth value for each pixel. This does not allow you to look "around" the object. There are, however, 3D reconstruction systems that start with stereo on image pairs and combine the depth maps in order to get a representation of the complete object, which is a non-trivial problem of its own right. Interestingly, this is often approached using a volumetric representation as an intermediate step.

Stereo algorithms can be and are often used to for "scenes", e.g. the road observed by a pair of cameras in a vehicle, or people in a room for 3D video conferencen.

Some closing remarks

  • For both stereo and volumetric reconstruction, graph cuts are just one of several methods to solve the problem. Stereo, for example, can also be formulated as a continuous optimization problem, rather than a discrete one, which implies other optimization methods for its solution.
  • My answer contains a bunch of generalizations and simplifications. It is not meant to be a definitive treatment of the subject.

I don't necessarily agree that smoothness is easier in the stereo case. Why do you think so?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top