This is actually a big research problem. You are correct, averaging all the descriptors will not be meaningful. There are several approaches out there for creating a single vector out of a set of local descriptors. One big class of methods is called "bag of features" or "bag of visual words". The general idea is to cluster local descriptors (e. g. sift) from many images (e. g. using k-means). Then you take a particular image, figure out which cluster each descriptor from that image belongs to, and create a histogram. There are different ways of doing the clustering and different ways of creating and normalizing the histogram.
A somewhat different approach is called "Pyramid Match Kernel", which is a way of training an SVM classifier on sets of local descriptors.
So for starters google "bag of features" or "bag of visual words".