PostgreSQL how to create a scalable location-based query

https://stackoverflow.com/questions/19733771

02-07-2022
|

Question

I need suggestions because I am not good enough.

I have a DB in PostgreSQL that runs on AWS (Amazon Web Services). I have a table "user_location" where there are stored the locations of each user and now there are more than 3 million of rows.

I have a script that runs the following query often in order to see if two users are nearby:

SELECT
    UL.id                          AS id, 
    UL.user_id                     AS user_id, 
    ST_X(UL.location::geometry)    AS lat, 
    ST_Y(UL.location::geometry)    AS lng, 
    UL.datetime                    AS datetime
FROM 
    public.user_location AS UL
WHERE 
    UL.user_id <> 1234567890 AND 
    '1890-10-31 03:00:00 +00:00' - UL.datetime <= interval '1' minute AND
    '1890-10-31 03:00:00 +00:00' >= UL.datetime AND
    ST_DWithin(UL.location, ST_GeogFromText('POINT(54 -1)'), 5000)
ORDER BY
    UL.datetime DESC;

The problem seems to be the radius, the execution time of the query grows exponentially by increasing the radius because it needs to check among more rows.

I need of a scalable solution where by increasing the radius around a given location, the execution time is almost the same. I need to "cut horizontally" the data by using before the datetime and just after the radius in the query, how can I do?

I have also the output of the EXPLAIN ANALYZE which is:

"Sort  (cost=389.72..389.73 rows=3 width=52) (actual time=136848.985..136848.985 rows=0 loops=1)"
"  Sort Key: datetime"
"  Sort Method: quicksort  Memory: 25kB"
"  ->  Bitmap Heap Scan on user_location ul  (cost=11.00..389.70 rows=3 width=52) (actual time=136848.976..136848.976 rows=0 loops=1)"
"        Recheck Cond: (location && '0101000020E6100000C182458F29494B4095E0C3DB39E3F3BF'::geography)"
"        Filter: ((user_id <> 1234567890) AND ('1890-10-31 03:00:00 +00:00'::timestamp with time zone >= datetime) AND (('1890-10-31 03:00:00 +00:00'::timestamp with time zone - datetime) <= '00:01:00'::interval minute) AND ('0101000020E6100000C182458F29494B4095E0C3DB39E3F3BF'::geography && _st_expand(location, 5000::double precision)) AND _st_dwithin(location, '0101000020E6100000C182458F29494B4095E0C3DB39E3F3BF'::geography, 5000::double precision, true))"
"        ->  Bitmap Index Scan on users_locations_gix  (cost=0.00..11.00 rows=91 width=0) (actual time=4463.249..4463.249 rows=165622 loops=1)"
"              Index Cond: (location && '0101000020E6100000C182458F29494B4095E0C3DB39E3F3BF'::geography)"
"Total runtime: 136849.591 ms"

Thanks in advance! Cheers

Solution

With 3 million rows you are going to want to cut down on the number that the query actually needs to evaluate. To do this it would be best if we know what your data looked like, but there are some fairly simple things to look at.

How many entries are you expecting within the minute you are specifying? I would guess it should be low. If it is, you could put an index(default btree one is fine) on UL.datetime (dont forget to VACUUM and ANALYZE after). Then change your query so that it will make good use of it.

 UL.datetime BETWEEN '1890-10-31 03:00:00 +00:00'
                 AND '1890-10-31 03:00:00 +00:00' + interval '1' minute AND

if you have too many rows between those dates, we will need to find a way to limit what needs to be evaluated through location.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow