How does Yelp efficiently calculate distance in the database?

https://dba.stackexchange.com/questions/4210

16-10-2019
|

سؤال

For example, say I have a table:

Business(BusinessID, Lattitude, Longitude)

All are indexed of course. Also there are 1 million records

Say I want to find businesses closest to 106,5, for example, how would I do so?

If I do

SELECT *
FROM Business
WHERE (Some formula to compute distance here) < 2000

for example, or if I do

SELECT *
FROM Business
TOP 20

In theory the computer will have to compute distance for all biz while in practice only those with lattitude and longitude within a certain range that should be computed.

So how can I do what I want in PhP, or SQL, for example?

I am grateful with the answer so far. I am using mysql and they do not have anything more efficient than the obvious solution. MySQL spatial do not have compute distance function either.

المحلول

If I understand the question correctly (and I'm not sure I do), you are worried about computing "(Some formula to compute distance here)" for every row in the table each time you do a query?

This can be mitigated to a degree by using the indexes on latitude and longitude so we only have to compute the distance for a 'box' of points containing the circle we actually want:

select * from business
where (latitude>96 and latitude<116) and 
      (longitude>-5 and longitude<15) and 
      (Some formula to compute distance here) < 2000

Where 96, 116 etc are chosen to match the unit of the value '2000' and the point on the globe you are calculating distances from.

How precisely this uses indexes will depend on your RDBMS and the choices its planner makes.

In general terms, this is a primitive way of optimising a kind of nearest neighbour search. If your RDBMS supports GiST indexes, like postgres then you should consider using them instead.

نصائح أخرى

(Disclosure: I'm a Microsoft SQL Server guy, so my answers are influenced by that.)

To really do it efficiently, there's two things you want: caching and native spatial data support. Spatial data support lets you store geography and geometry data directly in the database without doing intensive/expensive calculations on the fly, and lets you build indexes to very rapidly find the closest point to your current location (or most efficient route or whatever).

Caching is important if you want to scale, period. The fastest query is the one you never make. Whenever a user asks for the closest things to him, you store his location and the result set in a cache like Redis or memcached for a period of hours. Business locations aren't going to change for 4 hours - well, they might if someone edits a business, but you don't necessarily need that to be immediately updated in all result sets.

Yelp likely uses GIS

PostgreSQL has the reference implementation for GIS with PostGIS. Yelp may be using MySQL which is inferior in every way. In the case of something like Yelp, they almost certainly keep the coordinates for,

The user
The potential destinations

Those coordinates are almost certainly in WGS84, and stored as Geography type. In PostgreSQL, and PostGIS it would look something like this,

CREATE TABLE businesses (
  id   int               GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  name text,
  geog geography(point)
);
CREATE INDEX ON businesses USING gist(geog);
.... fill table
ANALYZE businesses;

They would fill that table. Then they grab the WGS84 coordinates from your phone and generate a query, like this with SQL Alchemy (in the case of Yelp),

SELECT *
FROM businesses AS b
WHERE ST_DWithin( b.geog, ST_MakePoint(userLong,userLat) );

For more information see our spatial, and check out Geographic Information Systems @ StackExchange

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange