I think it is in order to compare images with the same number of features.
If you extract "standard" SIFT, you don't know how many interest point you will obtain. So if you want to compare 2 images with a different number of features (different number of points) it will be complicated, you can't use directly SVM nor Neural Network... because the number of features for each image have to be the same.
With standard SIFT you need to match the points, find inliers and do others stuff or compute Bag of Visual Words before computing similarity between two images.
If you just want to know how SIFT work, you can check wikipedia and David Lowe articles.