I would like to optimize (read: make feasible at all) an SQL query.
The following PostgreSQL query retrieves the records I need. I (believe that I) could confirm that by running the query on a small subset of the actual DB.
SELECT B.*, A1.foo, A1.bar, A2.foo, A2.bar FROM B LEFT JOIN A as A1 on B.n1_id = A1.n_id LEFT JOIN A as A2 on B.n2_id = A2.n_id WHERE B.l_id IN (
SELECT l_id FROM C
WHERE l_id IN (
SELECT l_id FROM B
WHERE n1_id IN (SELECT n_id FROM A WHERE foo BETWEEN foo_min AND foo_max AND bar BETWEEN bar_min AND bar_max)
UNION
SELECT l_id FROM B
WHERE n2_id IN (SELECT n_id FROM A WHERE foo BETWEEN foo_min AND foo_max AND bar BETWEEN bar_min AND bar_max)
)
AND (property1 = 'Y' OR property2 = 'Y')
)
The relevant part of the DB looks as follows:
table A:
n_id (PK);
foo, int (indexed);
bar, int (indexed);
table B:
l_id (PK);
n1_id (FK, indexed);
n2_id (FK, (indexed);
table C:
l_id (PK, FK);
property1, char (indexed);
property2, char (indexed);
EXPLAIN
tells me this:
"Merge Join (cost=6590667.27..10067376.97 rows=453419 width=136)"
" Merge Cond: (A2.n_id = B.n2_id)"
" -> Index Scan using pk_A on A A2 (cost=0.57..3220265.29 rows=99883648 width=38)"
" -> Sort (cost=6590613.72..6591747.27 rows=453419 width=98)"
" Sort Key: B.n2_id"
" -> Merge Join (cost=3071304.25..6548013.91 rows=453419 width=98)"
" Merge Cond: (A1.n_id = B.n1_id)"
" -> Index Scan using pk_A on A A1 (cost=0.57..3220265.29 rows=99883648 width=38)"
" -> Sort (cost=3071250.74..3072384.28 rows=453419 width=60)"
" Sort Key: B.n1_id"
" -> Hash Semi Join (cost=32475.31..3028650.92 rows=453419 width=60)"
" Hash Cond: (B.l_id = C.l_id)"
" -> Seq Scan on B B (cost=0.00..2575104.04 rows=122360504 width=60)"
" -> Hash (cost=26807.58..26807.58 rows=453419 width=16)"
" -> Nested Loop (cost=10617.22..26807.58 rows=453419 width=16)"
" -> HashAggregate (cost=10616.65..10635.46 rows=1881 width=8)"
" -> Append (cost=4081.76..10611.95 rows=1881 width=8)"
" -> Nested Loop (cost=4081.76..5383.92 rows=1078 width=8)"
" -> Bitmap Heap Scan on A (cost=4081.19..4304.85 rows=56 width=8)"
" Recheck Cond: ((bar >= bar_min) AND (bar <= bar_max) AND (foo >= foo_min) AND (foo <= foo_max))"
" -> BitmapAnd (cost=4081.19..4081.19 rows=56 width=0)"
" -> Bitmap Index Scan on A_bar_idx (cost=0.00..740.99 rows=35242 width=0)"
" Index Cond: ((bar >= bar_min) AND (bar <= bar_max))"
" -> Bitmap Index Scan on A_foo_idx (cost=0.00..3339.93 rows=159136 width=0)"
" Index Cond: ((foo >= foo_min) AND (foo <= foo_max))"
" -> Index Scan using nx_B_n1 on B (cost=0.57..19.08 rows=19 width=16)"
" Index Cond: (n1_id = A.n_id)"
" -> Nested Loop (cost=4081.76..5209.22 rows=803 width=8)"
" -> Bitmap Heap Scan on A A_1 (cost=4081.19..4304.85 rows=56 width=8)"
" Recheck Cond: ((bar >= bar_min) AND (bar <= bar_max) AND (foo >= foo_min) AND (foo <= foo_max))"
" -> BitmapAnd (cost=4081.19..4081.19 rows=56 width=0)"
" -> Bitmap Index Scan on A_bar_idx (cost=0.00..740.99 rows=35242 width=0)"
" Index Cond: ((bar >= bar_min) AND (bar <= bar_max))"
" -> Bitmap Index Scan on A_foo_idx (cost=0.00..3339.93 rows=159136 width=0)"
" Index Cond: ((foo >= foo_min) AND (foo <= foo_max))"
" -> Index Scan using nx_B_n2 on B B_1 (cost=0.57..16.01 rows=14 width=16)"
" Index Cond: (n2_id = A_1.n_id)"
" -> Index Scan using pk_C on C (cost=0.57..8.58 rows=1 width=8)"
" Index Cond: (l_id = B.l_id)"
" Filter: ((property1 = 'Y'::bpchar) OR (property2 = 'Y'::bpchar))"
All three tables have millions of rows. I cannot change the table definitions.
The WHERE l_id IN ( SELECT l_id FROM B...UNION...)
is very restrictive and returns < 100 results.
What can I do to make the query execute in a reasonable amount of time (a few seconds max)?
EDIT: forgot to select two columns in the outermost SELECT
. That should now change this question though.
UPDATE
This seems to be a difficult question, possibly due to lack of information on my part. I would love to give more information, however the data base is properietary and confidential.
I can retrieve the rows of B with all conditions reasonably fast (0.1 s) with the following query:
WITH relevant_a AS (
SELECT * FROM A
WHERE
foo BETWEEN foo_min AND foo_max
AND
bar BETWEEN bar_min AND bar_max
)
WITH relevant_c AS (
SELECT * FROM C
WHERE l_id IN (
SELECT l_id FROM B
WHERE n1_id IN (
SELECT n_id FROM relevant_a
)
UNION
SELECT l_id FROM B
WHERE n2_id IN (
SELECT n_id FROM relevant_a
)
)
AND
(property1 = 'Y' OR property2= 'Y')
),
relevant_b AS (
SELECT * FROM B WHERE l_id IN (
SELECT l_id FROM relevant_c
)
)
SELECT * FROM relevant_b
The join with A is the part where it becomes slow. The query returns < 100 records, so why does the join with A make it so slow?. Do you have any ideas how to make this simple join faster? It should not be that costly to simply add four columns of information from another table.