Question

What is the right approach and clustering algorithm for geolocation clustering?

I'm using the following code to cluster geolocation coordinates:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

coordinates= np.array([
           [lat, long],
           [lat, long],
            ...
           [lat, long]
           ])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()

Is it right to use K-means for geolocation clustering, as it uses Euclidean distance, and not Haversine formula as a distance function?

Was it helpful?

Solution

K-means should be right in this case. Since k-means tries to group based solely on euclidean distance between objects you will get back clusters of locations that are close to each other.

To find the optimal number of clusters you can try making an 'elbow' plot of the within group sum of square distance. This may be helpful (http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb)

OTHER TIPS

K-means is not the most appropriate algorithm here.

The reason is that k-means is designed to minimize variance. This is, of course, appearling from a statistical and signal procssing point of view, but your data is not "linear".

Since your data is in latitude, longitude format, you should use an algorithm that can handle arbitrary distance functions, in particular geodetic distance functions. Hierarchical clustering, PAM, CLARA, and DBSCAN are popular examples of this.

https://www.youtube.com/watch?v=QsGOoWdqaT8 recommends OPTICS clustering.

The problems of k-means are easy to see when you consider points close to the +-180 degrees wrap-around. Even if you hacked k-means to use Haversine distance, in the update step when it recomputes the mean the result will be badly screwed. Worst case is, k-means will never converge!

GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas). Geohash is an adjustable precision clustering method.

I am probably very late with my answer, but if you are still dealing with geo clustering, you may find this study interesting. It deals with comparison of two fairly different approaches to classifying geographic data: K-means clustering and latent class growth modeling.

One of the images from the study:

enter image description here

The authors concluded that the end results were overall similar, and that there were some aspects where LCGM overperfpormed K-means.

You can use HDBSCAN for this. The python package has support for haversine distance which will properly compute distances between lat/lon points.

As the docs mention, you will need to convert your points to radians first for this to work. The following psuedocode should do the trick:

points = np.array([[lat1, lon1], [lat2, lon2], ...])
rads = np.radians(points)
clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='haversine')
cluster_labels = clusterer.fit_predict(points)

The k-means algorithm to cluster the locations is a bad idea. Your locations can be spread across the world and the number of clusters cant be predicted by you, not only that if you put the cluster as 1 then the locations will be grouped to 1 single cluster. I am using Hierarchical clustering for the same.

Java Apache commons-math does this pretty easily.

https://commons.apache.org/proper/commons-math/javadocs/api-3.1/org/apache/commons/math3/stat/clustering/DBSCANClusterer.html

List<Cluster<T>>    cluster(Collection<T> points) 

Go with Kmeans clustering as HBScan will take forever. I tried it for one of the project and ended but using Kmeans with desired results.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top