Object classification with Kinect using cascaded classifiers

Question

Kinect itself cannot recognize objects. It will give you a dense depth map. Then you can use the depth features along with some simple features (in your case, maybe color features or gradient features would do the job). Those features you input to a classifier (SVM or Random Forest for example) to train the system. You use the trained model for testing on new samples.

Regarding Haar features, I think they could do the job but you would need a sufficiently large database of features. It all depends on what you want to detect. In the case of an apple and a coin, just color would suffice.

Refer this paper to get an idea how to perform human pose recognition using Kinect camera. You just have to pay attention to their depth features and their classifiers. Do not apply their approach directly. Your problem is simpler.

Edit: simple gradient orientations histogram

Gradient orientations can give you a coarse idea about the shape of the object (It is not a shape-feature to be specific, better shape features exist, but this one is extremely fast to calculate).

Code snippet:

%calculate gradient
[dx,dy] = gradient(double(img));
A = (atan(dy./(dx+eps))*180)/pi;   %eps added to avoid division by zero.

A will contain orientation for each pixel. Segment your original image according to the depth values. For a segment having similar depth values, calculate color histogram. Extract the pixel orientations corresponding to that region, call it A_r. calculate a 9-bin (you can have more bins. Nine bins mean each bin will contain 180/9=20 degrees) histogram. Concatenate the color features and the gradient histogram. Do this for sufficient number of leaves. Then you can give this to a classifier for training.

Edit: This is a reply to a comment below.

Regarding MaxDepth parameter in opencv_traincascade

The documentation says, "Maximal depth of a weak tree. A decent choice is 1, that is case of stumps". When you perform binary classification, it takes a form of:

if yourFeatureValue>=learntThresh
   class=1;
else
   class=0;
end

The above type of classifier which performs thresholding on a single feature value (a scalar) is called decision stumps. There is only one split between positive and negative class (therefore maxDepth is one). For example, it would work in following scenario. Imagine you have a 1-D feature:

f=[1 2 3 4 -1 -2 -3 -4]

First 4 are class 1, rest are class 0. Decision stumps would get 100% accuracy on this data by setting the threshold to zero. Now, imagine a complicated feature space such as:

f=[1 2 3 4 5 6 7 8 9 10 11 12];

First 4 and last 4 are class 1, rest are class 0. Here, you cannot get 100% classification by decision stumps. You need two thresholds/splits. Therefore, you can construct a tree with depth value 2. You will have 2^(2-1)=2 thresholds. For depth=3, you get 4 thresholds, for depth=4, you get 8 thresholds and so on. Here, I assume a tree with a single node has height 1.

You may feel that the more the number of levels, you can achieve more accuracy, but then there is a problem of overfitting (and computation, memory storage etc.). Therefore, you have to set a good value for depth. I usually set it to 3.