is that SIFT main function is for finding similarity between 2 images, not grouping images?

Question

SIFT is algorithm that is meant to describe feature point so that descriptor is invariant to image translation, scaling, and rotation, changes in illumination and robust to local geometric distortion.

In simple words, you can think about SIRF that it is a way of generating descriptor for specific point in image so that this descriptor won't change if the image is zoomed in, moved around or even rotated. As this descriptor is not affected by image transformations, it can be used to compare features in different images that are taken in different conditions (different view, zoom, lightning).

If you want to compare 2 images, you don't need to do any training / create knowledge base. Just extract feature points from two images and compare their descriptors one-by-one. If the descriptors are the same (or almost the same) you can assume that they belong to the same object in image. Problems start when there are repetitive patterns.

If you want to cluster/group images in some specific way, then you need some criteria by what to do that. That's when knowledge base kicks in. For instance, if you would like to find images that contains human faces, you need a way to tell to the computer how a human face look like.

Of course, those algorithms are not 100% perfect and there are some weak points. For instance, if the image is changed/distorted too much, the descriptors start to differ.

UPDATED:

SIRF is just a method to generate description for specific feature in image. It has nothing to do with image classification by itself.

Bags-of-Words

Bags-of-Words is just a method to simplify analysis of image content. The main idea is that we can compare two images just by comparing their distinct features and their occurrences. If both images contain roughly the same features, those images are considered to be similar or even equal. It does not matter where those features are located in image. As SIRF descriptors are vectors with 128 dimensions, Bags-of-Words greatly simplifies process. Bags-of-Words could be used both in grouping and classification/recognition.

Knowledge base (training)

Whether you need knowledge base or not is completely dependant on how you do the clustering. If you don't use knowledge base, then you can do general clustering using SIFT to group together similar images without knowing what is on them. If you want to do clustering by some specific feature, then you need knowledge base.

Generally speaking, if you want to classify image to known groups, you use knowledge base. If you want to group similar images together without knowing what each group will contain, you don't need knowledge base.

Example:

Imagine you have 5 images - each of them contains one letter (A, B or C) and some background texture (wood, sand, cloth). Background texture takes most part of each image.

1. A - wood
2. B - cloth
3. C - wood
4. A - sand
5. C - cloth

1) Clustering without knowledge base (grouping)

If you had done clustering without knowledge base, we would every two images to see how similar are they.

We would come up with following:

Group 1 - 1., 3.
Group 2 - 2., 5.
Group 3 - 4.

You would not be able to tell what each group contains, but you would know that images in each group are somehow similiar. In this case they are most probably similar because of the same background.

2) Clustering with knowledge base (classification/recognition)

Now imagine we had a knowledge base that contained a lot of images of each letter. Now instead of comparing every two images, we could compare input image to knowledge base to determine to see what letter specific image is the most similar to.

Then you would come up with following:

Group A - 1., 4.
Group B - 2.
Group C - 3., 5.

In this case we know what each group contains as we have used knowledge base.

All that said, here is a paper on how object classification is done. In this paper SURF instead of SIRF is used, but it does not change the main idea.

PS. I am sorry if I oversimplify something, but I hope it makes easier to understand.