The title only mentions BoW, but in your text you also use SVMs.
I think the core idea of CBIR is, to find the most similar image, according to some distance measure. You can do this with BoW-features. The SVM is not necessary.
The main purpose of using additional classification is to speed up the process. Because after you obtained a class label for your test image, you only need to search this subgroup of your images for the best match. And of course, if the SVM is better in distinguishing certain classes than your distance measure, it might help to reduce errors.
So the standard workflow would be:
- obtain the class
- return the best match from the training samples of this class