MS SQL - Is using geometry data type to find distance significantly faster?

https://stackoverflow.com/questions/3968754

09-10-2019
|

Question

I have a database which contains a lot of geospatial data ... basically information on 10s of thousands of people, with coordinates for each of them.

The coordinates are currently stored as two floats for latitude and longitude, and I use a function to determine the distance between the coordinates in that record and a coordinate I pass in ... basically to sort and limit the results I get by distance. This is roughly the code used in the function.

DECLARE @earthSphereRadiusKilometers as float
DECLARE @kilometerConversionToMilesFactor as float
SELECT @earthSphereRadiusKilometers = 6366.707019
SELECT @kilometerConversionToMilesFactor = .621371

-- convert degrees to radians
DECLARE @lat1Radians float
DECLARE @lon1Radians float
DECLARE @lat2Radians float
DECLARE @lon2Radians float
SELECT @lat1Radians = (@lat1Degrees / 180) * PI()
SELECT @lon1Radians = (@lon1Degrees / 180) * PI()
SELECT @lat2Radians = (@lat2Degrees / 180) * PI()
SELECT @lon2Radians = (@lon2Degrees / 180) * PI()

-- formula for distance from [lat1,lon1] to [lat2,lon2]
RETURN ROUND(2 * ASIN(SQRT(POWER(SIN((@lat1Radians - @lat2Radians) / 2) ,2) + COS(@lat1Radians) * COS(@lat2Radians) * POWER(SIN((@lon1Radians - @lon2Radians) / 2), 2))) * (@earthSphereRadiusKilometers * @kilometerConversionToMilesFactor), 4)

The stored procedure is taking 4 or 5 seconds to run.

I've noticed that SQL Azure now supports the geometry data type .. (it didn't when I created the database).

So my question is ... would I experience a significant increase in the speed that my stored procedure would run that would make it worthwhile me investing the time it would take to change things over to using the geometry data type?

Thanks!

Steven

Solution

I can't give you the yes/no answer you are looking for, because I also have no experience with using the new spatial datatypes.

But what I can give you are some pointers:

First off: Your SP seems to just convert some geographical data. SQL Server 2008 has methods to do just that for you with the new geography datatype. Look at the OGC Methods on Geography Instances on the MSDN geography Data Type reference. So the new methods would at least give you the benefit of encapsulation.
Especially interesting for you must be the STDistance (STDistance (geography Data Type)) method, because it seems that this is what your SP is actually doing, calculating the distance from lat1, lon1 to lat2, lon2. I believe a built-in function is faster than a self-created function, but I wouldn't know without testing.

Using MS buzzwords, the spatial datatypes big plus is having spatial indexes. If you have some database with a lot of spatial data (your SP just converts some parameters), spatial indexes will bring you a performance increase. Or quoting from the spatial data whitepaper:

Performance of queries against spatial data is further enhanced by the inclusion of spatial index support in SQL Server 2008. You can index spatial data with an adaptive multi-level grid index that is integrated into the SQL Server database engine.

And then there are some articles suggesting the better performance of spatially indexed (is that a word?) data against normal indexes:

Performance is certainly enhanced... (from SQL Server 2008 Spatial Index Performance)

And then there is some nice graph comparing different kinds of holding spatial data against each other on the performance side: SQL Server 2008 Spatial - Performance of database calls?

So, to sum this up: Using spatial index WILL give you a performance increase. Whether using the pre-defined spatial methods will give you a significant performance increase, I don't know.

Bonus: To get you started with geography datatypes I suggest you read this blog post with lots of examples: Demystifying Spatial Support in SQL Server 2008.

OTHER TIPS

Your question "would I experience a significant increase in speed ... [by] changing things over to using the geometry data type?" seemed to disregard the possibility that using the dedicated spatial datatypes could actually slow things down. Yet, this may actually be the case, for several reasons.

Firstly, remember that the geometry and geography datatypes support not only points, but linestrings and polygons. The additional complexity they support means that they don't necessarily use simplistic point-to-point distance calculation. They also support a greater range of inbuilt functions on those types, so the serialized value of a point is more complex than just a set of lat, long coordinates. This means that a geometry/geography point value might be slower to retrieve and query than the equivalent columns of raw float coordinate data.

The second, and more significant factor relates to the accuracy with which the distance calculation is performed:

1.) If you have projected coordinates (i.e. UTM, National Grid, or State Plane) then coordinate values are measured in linear (x, y) units on a flat plane. Therefore it's easy to calculate the distance between two points using basic trigonometry: Dist(xy) = SQRT( (x2 - x1)2 + (y2 - y1)2 ) This is a simple mathematical method and you'll be unlikely to see much performance difference whether you implement this yourself or using the geometry datatype.

2.) If you have geographic coordinates (i.e. Latitude/Longitude) then these are measured in angular units on an ellipsoid. Most commonly, this is the WGS84 ellipsoid as used by WGS84 systems. In many cases, you can get a good enough approximation of the distance between two points on the ellipsoid by using simple spherical calculations instead, as you do in your stored procedure. However, the shape of the earth more closely resembles a squished sphere - it's wider at the equator than it is high to the poles, and your calculation doesn't allow for this flattening of the earth. The geography datatype uses ellipsoidal calculations, based on the ellipsoid model of the supplied SRID, which are necessarily more complex, but will result in a more accurate answer.

So I'd recommend that if you want to increase precision and functionality of your spatial data then you should move to spatial datatypes, but not for performance reasons.

I am about to start a new spatial project that will be running on SQL Server 2008. The application will take point data in Lat Lng (WGS 84) and will need to manipulate that data to generate lines and polygons and eventually display it on a Mercator map (OSM in EPSG:900913) which is a rectangular system.

We are not going to be receiving data for the entire world (just parts of Europe) so we do not need to worry about the date line. I'm leaning towards the idea of storing everything in a geometry data type in EPSG:900913 otherwise every point, line, and polygon will have to be converted to the display coordinate system every time a map is drawn (an we are drawing a lot of maps).

To be honest I'm new to SQL Server spatial, my experience has been with Oracle. I suppose what I am saying is that the choice of coordinate system or geometry type depends on what you are doing with the data. If you are having to convert a lot of data between coordinate systems (and that is what you are effectively doing in your distance calculation) then I would have thought storing the data in a suitable coordinate system would be faster.

So the questions must be then, did you switch to the native distance function that moontear mentioned and if so, how have microsoft implemented it? After all the distance calculation should be far simpler in a rectangular system or am I confusing myself?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow