Question

I've been running sci-kit learn's DBSCAN implementation to cluster a set of geotagged photos by lat/long. For the most part, it works pretty well, but I came across a few instances that were puzzling. For instance, there were two sets of photos for which the user-entered text field specified that the photo was taken at Central Park, but the lat/longs for those photos were not clustered together. The photos themselves confirmed that they both sets of observations were from Central Park, but the lat/longs were in fact further apart than epsilon.

After a little investigation, I discovered that the reason for this was because the lat/long geotags (which were generated from the phone's GPS) are pretty imprecise. When I looked at the location accuracy of each photo, I discovered that they ranged widely (I've seen a margin of error of up to 600 meters) and that when you take the location accuracy into account, these two sets of photos are within a nearby distance in terms of lat/long.

Is there any way to account for margin of error in lat/long when you're doing DBSCAN?

(Note: I'm not sure if this question is as articulate as it should be, so if there's anything I can do to make it more clear, please let me know.)

Was it helpful?

Solution

Note that DBSCAN doesn't actually need the distances.

Look up Generalized DBSCAN: all it really uses is a "is a neighbor of" relationship.

If you really need to incorporate uncertainty, look up the various DBSCAN variations and extensions that handle imprecise data explicitely. However, you may get pretty much the same results just by choosing a threshold for epsilon that is somewhat reasonable. There is room for choosing a larger epsilon that the one you deem adequate: if you want to use epsilon = 1km, and you assume your data is imprecise on the range of 100m, then use 1100m as epsilon instead.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top