Choosing a distance metric and measuring similarity

https://datascience.stackexchange.com/questions/77155

12-12-2020
|

Question

I am trying to decide which particular algorithm would be most appropriate for my use-case.

I have dataset of about 1000 physical buildings in a city with feature space such as location, distance, year built and other characteristics etc. For each new data point, a building, I'd like to find 3-5 buildings that are most similar based on feature space comparison.

I define similarity as weighted comparison of features. I'd like to iterate over entire feature space (w/ filter like location) and choose 3-5 most similar buildings matching the new building data point.

Here's what my data looks like:

I'm wondering what similarity measure would make sense? I work in python, so prefer a pythonic/sci-kit learn way of doing this.

La solution

It appears to me that what you're looking for in your use-case is not clustering - it's a distance metric.

When you get a new data point, you want to find the 3-5 most similar data points; there's no need for clustering for it. Calculate the distance from the new data point to each of the 'old' data points, and select the top 3-5.

Now, which distance metric to pick? There are options. If you're using SKLearn, I'd look over this page for example of distance(/similarity) metrics.

If your features are continuous, you can normalize them and use cosine similarity; Start with this, and see if it fits.

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange