Question

We have a hosted Postgres 11 DB server to fetch medium sized dataset and ingest into another application.

"States" table is the largest one in the following query, which will fetch ~1.5 million records to be ingested, and currently has ~1.6 million records.

with filtered_states as
(select * from pepys."States" States_cte
    WHERE
    tsrange('2010-09-01 10:00:30.000000', '2020-12-15 10:00:30.000000', '[]') @> States_cte.time and
    --Spatial criteria from the UI
    ST_Contains(ST_GeomFromText('SRID=4326;POLYGON((-125.0 65.0,-125.0 -45.0,80.0 -45.0,80.0 65.0,-125.0 65.0))'),States_cte.location))
select filtered_states.state_id, filtered_states.time, Sensors.name, Platforms.name,
PlatformTypes.name, Nationalities.name,
filtered_states.location, filtered_states.elevation, filtered_states.heading, filtered_states.course, filtered_states.speed from
filtered_states as filtered_states inner join
pepys."Sensors" as Sensors on filtered_states.sensor_id = Sensors.sensor_id inner join
pepys."Platforms" as Platforms on Sensors.host=Platforms.platform_id inner join
pepys."PlatformTypes" as PlatformTypes on Platforms.platform_type_id = PlatformTypes.platform_type_id inner join
pepys."Nationalities" as Nationalities on Platforms.nationality_id = Nationalities.nationality_id;

The explain analyze of the query is as follows

Nested Loop  (cost=5745.62..5750.97 rows=3 width=126) (actual time=3352.559..8350.784 rows=1509620 loops=1)
  CTE filtered_states
    ->  Bitmap Heap Scan on "States" states_cte  (cost=48.48..5745.10 rows=3 width=176) (actual time=545.383..2234.724 rows=1509620 loops=1)
          Recheck Cond: ('0103000020E610000001000000050000000000000000405FC000000000004050400000000000405FC000000000008046C0000000000000544000000000008046C0000000000000544000000000004050400000000000405FC00000000000405040'::geometry ~ location)
          Filter: (('["2010-09-01 10:00:30","2020-12-15 10:00:30"]'::tsrange @> "time") AND _st_contains('0103000020E610000001000000050000000000000000405FC000000000004050400000000000405FC000000000008046C0000000000000544000000000008046C0000000000000544000000000004050400000000000405FC00000000000405040'::geometry, location))
          Rows Removed by Filter: 115563
          Heap Blocks: exact=31242
          ->  Bitmap Index Scan on "idx_States_location"  (cost=0.00..48.48 rows=1625 width=0) (actual time=539.851..539.851 rows=1625183 loops=1)
                Index Cond: ('0103000020E610000001000000050000000000000000405FC000000000004050400000000000405FC000000000008046C0000000000000544000000000008046C0000000000000544000000000004050400000000000405FC00000000000405040'::geometry ~ location)
  ->  Nested Loop  (cost=0.37..4.07 rows=3 width=130) (actual time=3352.495..6841.898 rows=1509620 loops=1)
        ->  Nested Loop  (cost=0.24..3.02 rows=3 width=136) (actual time=3352.466..5467.492 rows=1509620 loops=1)
              ->  Hash Join  (cost=0.10..1.90 rows=3 width=112) (actual time=3352.415..3984.403 rows=1509620 loops=1)
                    Hash Cond: (sensors.sensor_id = filtered_states.sensor_id)
                    ->  Seq Scan on "Sensors" sensors  (cost=0.00..1.56 rows=56 width=40) (actual time=0.029..0.042 rows=56 loops=1)
                    ->  Hash  (cost=0.06..0.06 rows=3 width=104) (actual time=3351.739..3351.739 rows=1509620 loops=1)
                          Buckets: 32768 (originally 1024)  Batches: 4 (originally 1)  Memory Usage: 122310kB
                          ->  CTE Scan on filtered_states  (cost=0.00..0.06 rows=3 width=104) (actual time=545.388..2875.620 rows=1509620 loops=1)
              ->  Index Scan using "pk_Platforms" on "Platforms" platforms  (cost=0.14..0.37 rows=1 width=56) (actual time=0.001..0.001 rows=1 loops=1509620)
                    Index Cond: (platform_id = sensors.host)
        ->  Index Scan using "pk_PlatformTypes" on "PlatformTypes" platformtypes  (cost=0.13..0.41 rows=1 width=26) (actual time=0.001..0.001 rows=1 loops=1509620)
              Index Cond: (platform_type_id = platforms.platform_type_id)
  ->  Index Scan using "pk_Nationalities" on "Nationalities" nationalities  (cost=0.15..0.60 rows=1 width=28) (actual time=0.001..0.001 rows=1 loops=1509620)
        Index Cond: (nationality_id = platforms.nationality_id)
Planning Time: 52.165 ms
Execution Time: 8472.479 ms

We tried running the query from a python script to time it and found that the it took 9 minutes to return the cursor object.

# create a cursor
cur = conn.cursor()
# execute a statement
print(datetime.datetime.now())
start = timer()
cur.execute(query)
end = timer()
print(datetime.datetime.now())
print(end - start)

The following is the strcuture of States table

Column          Data type
state_id        uuid
time            timestamp without time zone
sensor_id       uuid
location        USER-DEFINED
elevation       double precision
heading         double precision
course          double precision
speed           double precision
source_id       uuid
privacy_id      uuid
created_date    timestamp without time zone
remarks         text

The user defined location type is a PostGis extension type (geometry type)

Need assistance in

  1. understanding why the execution time and actual time are way off.
  2. making this query return results faster
Was it helpful?

Solution

As per the question comments, the lost time could be due to the transfer of the data from server to client. Though 1.5 million rows isn't anything crazy, if your result set is wide (lot of columns), I could still see this being the case.

A better test on DBeaver would be to insert all ~1.5 million rows into a temporary table so the data never actually has to go from server to client.

If it can insert into the temporary table very quickly then the issue is a result of the data being handed to the client, if it is slow still then you have other issues going on.

As far as improving this performance bottleneck when you do need to pull the data to the client, it will greatly depend on how you return the data to the client. I'm unfortunately not an expert on Python and StackOverflow might yield you better answers on "how to pull a lot of data from a database in a performant manner". I know the term cursor is usually associated with bad performance on the database side, but I'm not sure what it means in the context of Python (since I don't do much in Python).

From a database perspective, if you can filter the data down further on the database side before it's returned to the client, that would ultimately be your best bet. Otherwise, you can design around paging the data, bringing back smaller subsets of the total records needed at a time.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top