Optimized user database search based on distance

https://stackoverflow.com/questions/11252534

18-06-2021
|

Pergunta

Sorry if this was already answered (I'm sure someone will throw links at me if it has). I thought a similar question a while back, but I can't find it now.

So, for the question: I'm constructing a user search for a site I'm developing, and one of the search criteria will be based on distance from the searching user. I already have a table of US zip codes and their corresponding Lat/Long. I've also figured out how to determine the bounding box (max lat/min lat - max long/min long) for the determination of which zips fit the criteria (we aren't going to worry about a precise radius. A geographic square will suffice for the time being). My question - how should I structure the query to optimize the speed? Should I:

Perform the required maths to determine the bounding box, then query the zip table to find all zip codes which are potential candidates followed by a search of users with any of those zip codes?

Determine the lat/long bounding box, join the zip table with the user table and return results with users who's lat/longs fall between the parameters?

I imagine the second method will be faster, but I have no supporting evidence/specific experience which suggests that it will. I know enough SQL to get around, but I'm still kind of new to it and have no clue when it comes down to the relative performance of different types of operations.

Thanks for your time!

Solução

I believe your final query should look like this:

-- compute @minLat, @maxLat, @minLon, @maxLon

SELECT users.*
FROM users
JOIN locations ON locations.id = users.location
WHERE locations.latitude BETWEEN @minLat AND @maxLat
AND locations.longitude BETWEEN @minLon AND @maxLon

so in this very case, I do not understand your concern as everything happens in one go. The query optimizer usually knows better than any human being which JOIN to perform first.

In case you want to implement a more complex computation to determine whether a ZIP code falls within range, then I would prefer to first establish a list of ZIP codes, then match users living in these areas.

This assumes the computing of whether a ZIP code is within search range is the most costly part of the operation. Therefore I would prefer running this calculation with the smallest possible data set (i.e. ZIP codes only, instead of ZIP + users). And even in this case, the query optimizer might be able to do the right choice for you.

Outras dicas

The two algorithms that you describe could be schematically described like this:

A INNER JOIN B WHERE A satisfies condition

and

(A WHERE A satisfies condition) INNER JOIN B

The former is simply a join (the condition could be a join condition or a WHERE condition, but that does not really matter with an INNER JOIN and MySQL).

The latter involves a subquery. Your description seems to assume that the subquery is computed first, followed by the join, but that is generally not the case. The inner join is evaluated first and the subquery second, which may well give you the same execution plan as in the first case.

So these two approaches do not appear to be different from the perspective of performance, and you should focus on choosing one that will be easiest for you to read and maintain, and, when the day comes, profile and optimize it.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow