Question

I would like to optimize (read: make feasible at all) an SQL query.

The following PostgreSQL query retrieves the records I need. I (believe that I) could confirm that by running the query on a small subset of the actual DB.

SELECT B.*, A1.foo, A1.bar, A2.foo, A2.bar FROM B LEFT JOIN A as A1 on B.n1_id = A1.n_id LEFT JOIN A as A2 on B.n2_id = A2.n_id WHERE B.l_id IN (
    SELECT l_id FROM C 
        WHERE l_id IN (
            SELECT l_id FROM B 
                WHERE n1_id IN (SELECT n_id FROM A WHERE foo BETWEEN foo_min AND foo_max AND bar BETWEEN bar_min AND bar_max)
            UNION
            SELECT l_id FROM B 
                WHERE n2_id IN (SELECT n_id FROM A WHERE foo BETWEEN foo_min AND foo_max AND bar BETWEEN bar_min AND bar_max)
            ) 
            AND (property1 = 'Y' OR property2 = 'Y')
    )

The relevant part of the DB looks as follows:

table A:
n_id (PK);
foo, int (indexed);
bar, int (indexed);

table B:
l_id (PK);
n1_id (FK, indexed);
n2_id (FK, (indexed);

table C:
l_id (PK, FK);
property1, char (indexed);
property2, char (indexed);

EXPLAIN tells me this:

"Merge Join  (cost=6590667.27..10067376.97 rows=453419 width=136)"
"  Merge Cond: (A2.n_id = B.n2_id)"
"  ->  Index Scan using pk_A on A A2  (cost=0.57..3220265.29 rows=99883648 width=38)"
"  ->  Sort  (cost=6590613.72..6591747.27 rows=453419 width=98)"
"        Sort Key: B.n2_id"
"        ->  Merge Join  (cost=3071304.25..6548013.91 rows=453419 width=98)"
"              Merge Cond: (A1.n_id = B.n1_id)"
"              ->  Index Scan using pk_A on A A1  (cost=0.57..3220265.29 rows=99883648 width=38)"
"              ->  Sort  (cost=3071250.74..3072384.28 rows=453419 width=60)"
"                    Sort Key: B.n1_id"
"                    ->  Hash Semi Join  (cost=32475.31..3028650.92 rows=453419 width=60)"
"                          Hash Cond: (B.l_id = C.l_id)"
"                          ->  Seq Scan on B B  (cost=0.00..2575104.04 rows=122360504 width=60)"
"                          ->  Hash  (cost=26807.58..26807.58 rows=453419 width=16)"
"                                ->  Nested Loop  (cost=10617.22..26807.58 rows=453419 width=16)"
"                                      ->  HashAggregate  (cost=10616.65..10635.46 rows=1881 width=8)"
"                                            ->  Append  (cost=4081.76..10611.95 rows=1881 width=8)"
"                                                  ->  Nested Loop  (cost=4081.76..5383.92 rows=1078 width=8)"
"                                                        ->  Bitmap Heap Scan on A  (cost=4081.19..4304.85 rows=56 width=8)"
"                                                              Recheck Cond: ((bar >= bar_min) AND (bar <= bar_max) AND (foo >= foo_min) AND (foo <= foo_max))"
"                                                              ->  BitmapAnd  (cost=4081.19..4081.19 rows=56 width=0)"
"                                                                    ->  Bitmap Index Scan on A_bar_idx  (cost=0.00..740.99 rows=35242 width=0)"
"                                                                          Index Cond: ((bar >= bar_min) AND (bar <= bar_max))"
"                                                                    ->  Bitmap Index Scan on A_foo_idx  (cost=0.00..3339.93 rows=159136 width=0)"
"                                                                          Index Cond: ((foo >= foo_min) AND (foo <= foo_max))"
"                                                        ->  Index Scan using nx_B_n1 on B  (cost=0.57..19.08 rows=19 width=16)"
"                                                              Index Cond: (n1_id = A.n_id)"
"                                                  ->  Nested Loop  (cost=4081.76..5209.22 rows=803 width=8)"
"                                                        ->  Bitmap Heap Scan on A A_1  (cost=4081.19..4304.85 rows=56 width=8)"
"                                                              Recheck Cond: ((bar >= bar_min) AND (bar <= bar_max) AND (foo >= foo_min) AND (foo <= foo_max))"
"                                                              ->  BitmapAnd  (cost=4081.19..4081.19 rows=56 width=0)"
"                                                                    ->  Bitmap Index Scan on A_bar_idx  (cost=0.00..740.99 rows=35242 width=0)"
"                                                                          Index Cond: ((bar >= bar_min) AND (bar <= bar_max))"
"                                                                    ->  Bitmap Index Scan on A_foo_idx  (cost=0.00..3339.93 rows=159136 width=0)"
"                                                                          Index Cond: ((foo >= foo_min) AND (foo <= foo_max))"
"                                                        ->  Index Scan using nx_B_n2 on B B_1  (cost=0.57..16.01 rows=14 width=16)"
"                                                              Index Cond: (n2_id = A_1.n_id)"
"                                      ->  Index Scan using pk_C on C  (cost=0.57..8.58 rows=1 width=8)"
"                                            Index Cond: (l_id = B.l_id)"
"                                            Filter: ((property1 = 'Y'::bpchar) OR (property2 = 'Y'::bpchar))"

All three tables have millions of rows. I cannot change the table definitions. The WHERE l_id IN ( SELECT l_id FROM B...UNION...) is very restrictive and returns < 100 results.

What can I do to make the query execute in a reasonable amount of time (a few seconds max)?

EDIT: forgot to select two columns in the outermost SELECT. That should now change this question though.

UPDATE This seems to be a difficult question, possibly due to lack of information on my part. I would love to give more information, however the data base is properietary and confidential.

I can retrieve the rows of B with all conditions reasonably fast (0.1 s) with the following query:

WITH relevant_a AS (
    SELECT * FROM A 
        WHERE
            foo BETWEEN foo_min AND foo_max 
            AND
            bar BETWEEN bar_min AND bar_max
)
WITH relevant_c AS (
    SELECT * FROM C
        WHERE l_id IN (
            SELECT l_id FROM B
                WHERE n1_id IN (
                    SELECT n_id FROM relevant_a
                )
            UNION
            SELECT l_id FROM B
                WHERE n2_id IN (
                    SELECT n_id FROM relevant_a
                )
        )
        AND
        (property1 = 'Y' OR property2= 'Y')
),
relevant_b AS (
    SELECT * FROM B WHERE l_id IN (
        SELECT l_id FROM relevant_c
    )
)

SELECT * FROM relevant_b

The join with A is the part where it becomes slow. The query returns < 100 records, so why does the join with A make it so slow?. Do you have any ideas how to make this simple join faster? It should not be that costly to simply add four columns of information from another table.

Was it helpful?

Solution 5

I could make it work with the following query:

WITH relevant_a AS (
    SELECT * FROM A 
        WHERE
            foo BETWEEN foo_min AND foo_max 
            AND
            bar BETWEEN bar_min AND bar_max
),
relevant_c AS (
    SELECT * FROM C
        WHERE l_id IN (
            SELECT l_id FROM B
                WHERE n1_id IN (
                    SELECT n_id FROM relevant_a
                )
            UNION
            SELECT l_id FROM B
                WHERE n2_id IN (
                    SELECT n_id FROM relevant_a
                )
        )
        AND
        (property1 = 'Y' OR property2= 'Y')
),
relevant_b AS (
    SELECT * FROM B WHERE l_id IN (
        SELECT l_id FROM relevant_c
    )
),
a1_data AS (
    SELECT A.n_id, A.foo, A.bar
    FROM A
    WHERE A.n_id IN (
        SELECT n1_id FROM relevant_b
    )
)
a2_data AS (
    SELECT A.n_id, A.foo, A.bar
    FROM A
    WHERE A.n_id IN (
        SELECT n2_id FROM relevant_b
    )
)

SELECT relevant_b.*, a1_data.foo,  a1_data.bar,  a2_data.foo,  a2_data.bar 
FROM relevant_b
LEFT JOIN a1_data ON relevant_b.n1_id = a1_data.n_id
LEFT JOIN a2_data ON relevant_b.n2_id = a2_data.n_id

I do not like this solution as it seems forced and redundant. However, it gets the job done in < 0.1 s.

I still refuse to believe that an SQL noob like me can (quite easily in retrospect) come up with a statement that forces the optimizer to use a strategy much better than the one it comes up with on its own. There has to be a better solution but I will not be looking for one anymore.

Regardless, thank you all for your suggestions, I definetely learned a few things on the way.

OTHER TIPS

Or something like this:

SELECT B.*, A1.foo, A2.bar 
FROM B 
     LEFT JOIN A as A1 on B.n1_id = A1.n_id 
     LEFT JOIN A as A2 on B.n2_id = A2.n_id 
     INNER JOIN C on (C.l_id = B.l_id)
where 
     A1.foo between A1.foo_min AND A1.foo_max AND 
     A2.bar BETWEEN A2.bar_min AND A2.bar_max and
     b.foo between b.foo_min AND b.foo_max AND 
     b.bar BETWEEN b.bar_min AND bar_max   AND
     (C.property1 = 'Y' OR C.property2 = 'Y')
  1. You could club both the Left joins on A (A1,A2) into 1 Left join and use a OR for the ON Clause
  2. Change INs to Exists
  3. Change the UNION into 1 query using a OR Clause

Try this...

    SELECT B.*
    , A1.foo
    , A2.bar
FROM B
LEFT JOIN A AS A1
    ON (
            B.n1_id = A1.n_id
            OR B.n2_id = A1.n_id
            )
WHERE EXISTS (
        SELECT l_id
        FROM C
        WHERE EXISTS (
                SELECT l_id
                FROM B
                WHERE EXISTS (
                        SELECT n_id
                        FROM A
                        WHERE foo BETWEEN foo_min
                                AND foo_max
                            AND bar BETWEEN bar_min
                                AND bar_max
                            AND (
                                A.n_id = B.n_id
                                OR A.n_id = B.n2_id
                                )
                        )
                    AND B.l_id = C.l_id
                )
        )

There should be only one index for both columns foo and bar in Table A, so this would be optimal for both between clauses (find by your self witch would be the better columns order) Table B is not optimally designed for this query, you should traspose values n1_id and n2_id in 2 different rows with same key and column_id (crosstab).

The follwing query should return then same data with a huge perfomance improvement.

with b_norm(l_id, role, n_id) as (
        select l_id, unnest(Array['1','2']) as role, unnest(Array[n1_id, n2_id]) as n_id
        from b
    )
select *
from (
        select distinct l_id
        from a
            join b_norm using (n_id)
            join c using (l_id)
        where bar between 0 and 10000
            and foo between 10000 and 20000
            and (c.property1 = 'Y' or c.property2 = 'Y')
    ) as driver
    join b using (l_id)
    join (
        select a.n_id as n1_id, foo as foo1, bar as bar1
        from a
    )  as a1 using (n1_id)
    join (
        select a.n_id as n2_id, foo as foo2, bar as bar2
        from a
    ) as a2 using (n2_id);

As far as I see it, you select B where at least one of the two associated As is within the given ranges. Moreover you require there exists a C for the B. Then you show foo and bar of the two associated As with the B values.

SELECT B.*, A1.foo, A2.bar
FROM B 
LEFT JOIN A A1 ON A1.n_id = B.n1_id
LEFT JOIN A A2 ON A2.n_id = B.n2_id
WHERE 
(
  (A1.foo BETWEEN foo_min AND foo_max AND A1.bar BETWEEN bar_min AND bar_max)
  OR
  (A2.foo BETWEEN foo_min AND foo_max AND A2.bar BETWEEN bar_min AND bar_max)
)
AND EXISTS
(
  SELECT *
  FROM C
  WHERE C.l_id = B.l_id
  AND (property1 = 'Y' OR property2 = 'Y')
);

Can B.n1_id and B.n2_id be NULL? Then you need the left outer joins. Otherwise you can replace them with inner joins.

EDIT: Oops I missed the C criteria. I've altered the statement accordingly.

EDIT: Reacting to your comments, here is the same select with INNER JOINs and an IN clause:

SELECT B.*, A1.foo, A2.bar
FROM B 
INNER JOIN A A1 ON A1.n_id = B.n1_id
INNER JOIN A A2 ON A2.n_id = B.n2_id
WHERE 
(
  (A1.foo BETWEEN foo_min AND foo_max AND A1.bar BETWEEN bar_min AND bar_max)
  OR
  (A2.foo BETWEEN foo_min AND foo_max AND A2.bar BETWEEN bar_min AND bar_max)
)
AND B.l_id IN
(
  SELECT l_id
  FROM C
  WHERE property1 = 'Y' OR property2 = 'Y'
);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top