mysql 27 second query (haversine vs geomfromtext)... must be a better way

https://stackoverflow.com/questions/17329024

01-06-2022
|

Question

I've shortened the tables to only show the relevant columns for this query. It requires two tables and the queries are taking a long time and we've not even rolled into the 4+ million queries and a log file that is 30+ million records or a user table with 1+ million records. It has me rethinking this... I need some guidance and suggestions:

Here's the table:

// an abreviated users table
CREATE TABLE IF NOT EXISTS `users` (
  `userid` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` tinyint(1) NOT NULL COMMENT '1=biz, 2=apt, 3=condo, 4=home',
  `distance` decimal(12,7) NOT NULL DEFAULT '1.0000000' COMMENT 'distance away to recv stuff',
  `lat` decimal(12,7) NOT NULL,
  `lon` decimal(12,7) NOT NULL,
  `location` point NOT NULL COMMENT 'GeomFromText',
  UNIQUE KEY `userid` (`userid`),
  KEY `distance` (`distance`),
  KEY `lat` (`lat`),
  KEY `lon` (`lon`),
  SPATIAL KEY `location` (`location`),
  KEY `idx_user_type` (`type`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=501 ;

Here's the log table.

// pretty much the full log table
CREATE TABLE IF NOT EXISTS `some_log` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'record num',
  `userid` int(11) unsigned NOT NULL COMMENT 'user id receiving alert',
  `trackid` bigint(20) unsigned NOT NULL COMMENT 'id of msg from message table',
  `sent` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'when msg created',
  PRIMARY KEY (`id`),
  KEY `idx_msg_log_userid` (`userid`),
  KEY `idx_msg_log_trackid` (`trackid`)
) ENGINE=MyISAM  DEFAULT CHARSET=ascii COMMENT='log of all of some stuff' AUTO_INCREMENT=62232;

Some sample data for the log file

INSERT INTO `some_log` (`id`, `userid`, `trackid`, `sent`) VALUES
(1, 1, 4, '2011-07-14 18:14:25'),
(2, 2, 4, '2011-07-14 18:14:25'),
(3, 13, 6, '2011-07-25 23:05:54'),
(4, 44, 7, '2011-08-09 16:20:02'),
(5, 12, 17, '2011-08-16 07:35:01'),
(6, 43, 17, '2011-08-16 07:35:01'),
(7, 45, 17, '2011-08-16 07:35:01'),
(8, 12, 18, '2011-08-16 08:05:01'),
(9, 43, 18, '2011-08-16 08:05:01'),
(10, 45, 18, '2011-08-16 08:05:01');

Here's the query.

// the query = $distance can be from 1/10th mile to 5 miles
SELECT *,(((acos(sin(($lat *pi()/180)) * sin((`lat`*pi()/180))+cos(($lat *pi()/180)) * cos((`lat`*pi()/180))* cos((($lon - `lon`)*pi()/180))))*180/pi())*60*1.1515) AS dist_x 
   FROM `users`
   WHERE userid NOT IN (
      SELECT userid
      FROM some_log AS L
      WHERE L.trackid='$trackid')
   HAVING dist_x<='$distance' AND dist_x<=`distance` 
   ORDER BY dist_x ASC";

Here's another query. This one is slow.

// the above query is pretty quick given the test data
// this query is dog crap slow...
// we added in type and 4 is the most common type of user
SELECT *,(((acos(sin(($lat *pi()/180)) * sin((`lat`*pi()/180))+cos(($lat *pi()/180)) * cos((`lat`*pi()/180))* cos((($lon - `lon`)*pi()/180))))*180/pi())*60*1.1515) AS dist_x 
   FROM `users`
   WHERE type='4' AND userid NOT IN (
      SELECT userid
      FROM some_log AS L
      WHERE L.trackid='$trackid')
   HAVING dist_x<='$distance' AND dist_x<=`distance` 
   ORDER BY dist_x ASC";

One question would be: Is there a radius/circle search that uses GeomFromText/POINT field vs the lat/lon search?

Another question: Is there a better way to check the some_log table for an entry where this $userid already has a $trackid?

Solution

Forget about the spatial index and spatial column in the table. They won't help you with lat-lon computations.

You can use your lat index to exclude whole bunches of pairs of points from the haversine computation. Take advantage of this fact: There are approx 69 statute miles, 60 nautical miles, or 111.045 km per degree of of latitude. (That's not exact, but it's pretty close).

So you can add a couple of conditions to your query. These will add a range-scan on your lat index, which is a lot faster than just your HAVING condition.

WHERE ....
  AND $lat >= lat - ($distance/69.0)
  AND $lat <= lat + ($distance/69.0)
  ...

This will exclude all your points that are too far north or too far south to be included in your haversine distance caLculation. This will save a lot of time.

You can also do this for lon, but the relationship between longitude and distance varies based on latitude. Longitude lines get closer together the closer you are to the poles. Therefore the formula is trickier.

Finally, float is a perfectly good data type for lat and lon. You don't need high-precision decimal data for this application, unless you're a civil engineer and you care that the earth's true shape is a geoid, not a sphere. If you care about that, you also better use a more precise distance formula than the haversine. But we're talking about differences of centimeters here -- big puddles in parking lots, but no problem for store finders.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow