Question

I have a dataset of points;

 lat   |long    | time
 34.53  -126.34  1
 34.52  -126.32  2
 34.51  -126.31  3
 34.54  -126.36  4
 34.59  -126.28  5
 34.63  -126.14  6
 34.70  -126.05  7
 ...

(Much larger dataset, but this is the general structure.)

I want to cluster points based on distance and time. DBSCAN seems like a good choice, since I don't know how many clusters there are.

I am using, currently, minute/5500 (which is approx 20 meters, scaled, I believe.)

library(fpc)
 results<-dbscan(data,MinPts=3,eps=0.00045,method="raw",scale=FALSE,showplot=1)

I am having a problem understanding how the scaling / distance is determined, since I have raw data. I can guess at values for eps when scaled or unscaled, but I am unclear what the scaling does, or what distance metric is being used (Euclidean distance, perhaps?) Is there documentation on this somewhere?

(This is not about finding an automated way to choose, (like Choosing eps and minpts for DBSCAN (R)? ) but about what the different values mean. Saying "You need a distance function first" doesn't explain what the distance function being used is, or how to create one...)

Was it helpful?

Solution

First calculate the distance matrix of your data. Then, instead of using method='row' you could use method='dist'. In this way, dbscan will treat your data as distance matrix and so no need to worry about how distance function is implemented. Note that this might require more memory since you're pre-calculating distance matrix and store it in memory.

OTHER TIPS

I don't use R/fpc but ELKI, so I can't really answer your question. The reason is that I have found it to be substantially faster than fpc, in particular when you can use indexes. When you work with data sets in the million points, the difference is huge.

Furthermore, it's very flexible, and that seems to be what you need:

ELKI does have a LatLng distance function that uses the great circle distance. Then I can set epsilon easily in kilometres.

However, you also have a time attribute. Do you have any plans on including this in your analysis yet? ELKI has a tutorial on writing custom distance functions, which is probably what you need then. You should be able to reuse the great circle distance, and here is a neat trick with DBSCAN for you:

DBSCAN doesn't really need the distances. It needs to know the neighbors, but the distances are only used for comparison to epsilon. So by defining a distance function that is 0 when two objects should be similar, and 1 if the should be different, along with an epsilon of 0.5, you can do much more complex clusterings. In your context, you could define your distance function as:

0 if the distance is less than 0.1 km and the time difference is also less than t
1 otherwise
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top