Indexes not used with OR
-
06-10-2020 - |
Question
I have query such as this:
SELECT t0.id
FROM platform_conversations t0
LEFT OUTER JOIN contacts t1 ON t1.id = t0.contact_id
WHERE t0.user_id = 5340
AND (
t0.participant ILIKE '%baa%' -- (1)
OR t1.first_name ILIKE '%baa%' -- (2)
)
LIMIT 50;
and
CREATE INDEX ix_conversations_participant
ON platform_conversations USING GIN (participant gin_trgm_ops);
CREATE INDEX ix_trgm_contacts_search
ON contacts USING GIN (first_name gin_trgm_ops);
And can't figure out why indexes are not used with OR
condition. If I use only (1), or only (2), or use AND
instead, they are both used.
Here's the plan:
For only (1)
Limit (cost=12.43..203.68 rows=50 width=37) (actual time=0.037..0.037 rows=0 loops=1)
-> Bitmap Heap Scan on platform_conversations t0 (cost=12.43..222.80 rows=55 width=37) (actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: ((participant)::text ~~* '%baa%'::text)
Filter: (user_id = 5340)
-> Bitmap Index Scan on ix_conversations_participant (cost=0.00..12.42 rows=55 width=0) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: ((participant)::text ~~* '%baa%'::text)
Planning time: 0.397 ms
Execution time: 0.092 ms
For only (2)
Limit (cost=16.71..446.06 rows=50 width=37) (actual time=0.034..0.034 rows=0 loops=1)
-> Nested Loop (cost=16.71..471.82 rows=53 width=37) (actual time=0.028..0.028 rows=0 loops=1)
-> Bitmap Heap Scan on contacts t1 (cost=16.29..158.99 rows=37 width=37) (actual time=0.022..0.022 rows=0 loops=1)
Recheck Cond: ((first_name)::text ~~* '%baa%'::text)
-> Bitmap Index Scan on ix_trgm_contacts_search (cost=0.00..16.28 rows=37 width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: ((first_name)::text ~~* '%baa%'::text)
-> Index Scan using ix_platform_conversations_contact_id on platform_conversations t0 (cost=0.42..8.45 rows=1 width=74) (never executed)
Index Cond: ((contact_id)::text = (t1.id)::text)
Filter: (user_id = 5340)
Planning time: 0.840 ms
Execution time: 0.121 ms
Slow one (with OR):
Limit (cost=25771.43..47419.63 rows=50 width=37) (actual time=6652.292..6652.292 rows=0 loops=1)
-> Hash Left Join (cost=25771.43..72531.55 rows=108 width=37) (actual time=6652.282..6652.282 rows=0 loops=1)
Hash Cond: ((t0.contact_id)::text = (t1.id)::text)
Filter: (((t0.participant)::text ~~* '%baa%'::text) OR ((t1.first_name)::text ~~* '%baa%'::text))
Rows Removed by Filter: 553477
-> Seq Scan on platform_conversations t0 (cost=0.00..20627.29 rows=553512 width=87) (actual time=0.045..1741.899 rows=553477 loops=1)
Filter: (user_id = 5340)
Rows Removed by Filter: 386
-> Hash (cost=17578.19..17578.19 rows=384819 width=46) (actual time=2372.508..2372.508 rows=384819 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 2359kB
-> Seq Scan on contacts t1 (cost=0.00..17578.19 rows=384819 width=46) (actual time=0.014..1168.218 rows=384819 loops=1)
Planning time: 0.741 ms
Execution time: 6652.333 ms
Can't it just do bitmap index scans on t1, t2 and then bitmap ORs on them? Why is OR such a problem?
Postgres 9.6 (can't create tag, not enough rep)
Solution
Can't it just do bitmap index scans on t1, t2 and then bitmap ORs on them? Why is OR such a problem?
The question is: "Do you want filter data AFTER doing join ? OR Do you want filter data BEFORE doing join?"
Because of lacking data when using OR
. That's why PG needs to do a sequence scan
on those tables.
Your condition is t1.id = t2.contract_id, (1) or (2)
.
In your case, supposed that PG will use bitmap
index, i.e. t1
uses (1)
and t2
uses (2)
. Here, you can see if (1)
is false (!), then check (2)
. The problem is if (2)
is true, data of columns of (1)
will be lacked.
(!)
If PG really uses bitmap
index, this case has never occurred (because (1) is always true) .
Here, my simple example (Postgres 9.6) about AFTER-BEFORE join:
create table t1 (a int, b text, c text);
insert into t1 select a, md5(a::text), md5((a+1)::text) from generate_series(1, 1e6)a ;
create table t2 as select * from t1;
CREATE INDEX idx1 ON t1 USING GIN (b gin_trgm_ops);
CREATE INDEX idx2 ON t2 USING GIN (c gin_trgm_ops);
explain analyze select * from t1 left outer join t2 on t1.a = t2.a where t1.a > 10 and t1.b like '%abcd%';
explain analyze select * from t1 left outer join t2 on t1.a = t2.a where t1.a > 10 and t2.c like '%abcd%';
explain analyze select * from t1 left outer join t2 on t1.a = t2.a where t1.a > 10 and ( t1.b like '%abcd%' or t2.c like '%abcd%') ;
----------------------------------------------------------------------------------------------------------------------------------
Gather (cost=47565.00..90289.93 rows=10199 width=140) (actual time=749.517..8782.051 rows=848 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Hash Left Join (cost=46565.00..88270.03 rows=10199 width=140) (actual time=781.632..8641.975 rows=212 loops=4)
Hash Cond: (t1.a = t2.a)
Filter: ((t1.b ~~ '%abcd%'::text) OR (t2.c ~~ '%abcd%'::text))
Rows Removed by Filter: 249786
-> Parallel Seq Scan on t1 (cost=0.00..16378.26 rows=322548 width=70) (actual time=0.069..105.463 rows=249998 loops=4)
Filter: (a > 10)
Rows Removed by Filter: 2
-> Hash (cost=22346.00..22346.00 rows=1000000 width=70) (actual time=754.846..754.846 rows=1000000 loops=4)
Buckets: 65536 Batches: 32 Memory Usage: 3636kB
-> Seq Scan on t2 (cost=0.00..22346.00 rows=1000000 width=70) (actual time=0.039..270.865 rows=1000000 loops=4)
Planning time: 0.262 ms
Execution time: 8790.209 ms
explain analyze
select *
from ( select * from t1 where t1.a > 10 and t1.b like '%abcd%' ) a
left outer join (select * from t2 where t2.a > 10 and t2.c like '%abcd%' ) b on a.a = b.a ;
------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=522.81..12999.45 rows=10100 width=140) (actual time=2.647..3.316 rows=424 loops=1)
Hash Cond: (t1.a = t2.a)
-> Bitmap Heap Scan on t1 (cost=118.28..12557.04 rows=10100 width=70) (actual time=1.048..1.604 rows=424 loops=1)
Recheck Cond: (b ~~ '%abcd%'::text)
Rows Removed by Index Recheck: 47
Filter: (a > 10)
Heap Blocks: exact=465
-> Bitmap Index Scan on idx1 (cost=0.00..115.76 rows=10101 width=0) (actual time=0.974..0.974 rows=471 loops=1)
Index Cond: (b ~~ '%abcd%'::text)
-> Hash (cost=403.28..403.28 rows=100 width=70) (actual time=1.590..1.590 rows=424 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> Bitmap Heap Scan on t2 (cost=28.77..403.28 rows=100 width=70) (actual time=0.988..1.533 rows=424 loops=1)
Recheck Cond: (c ~~ '%abcd%'::text)
Rows Removed by Index Recheck: 47
Filter: (a > 10)
Heap Blocks: exact=465
-> Bitmap Index Scan on idx2 (cost=0.00..28.75 rows=100 width=0) (actual time=0.922..0.922 rows=471 loops=1)
Index Cond: (c ~~ '%abcd%'::text)
Planning time: 0.262 ms
Execution time: 3.397 ms
(20 rows)
UPDATE from Kadek
That's great, union is a good choice.
Based on your query, you can create a new first_name
column on platform_conversations
table, then create an index for two columns first_name
and participant
. Of course, you should balance between write and read flow.
CREATE INDEX idx1 ON t1 USING GIN (b gin_trgm_ops);
CREATE INDEX idx2 ON t1 USING GIN (c gin_trgm_ops);
explain analyze select * from t1 where t1.a > 10 and ( t1.b like '%abcd%' or t1.c like '%abcd%' );
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=149.61..12643.62 rows=10199 width=70) (actual time=2.315..3.423 rows=848 loops=1)
Recheck Cond: ((b ~~ '%abcd%'::text) OR (c ~~ '%abcd%'::text))
Rows Removed by Index Recheck: 94
Filter: (a > 10)
Heap Blocks: exact=470
-> BitmapOr (cost=149.61..149.61 rows=10201 width=0) (actual time=2.244..2.244 rows=0 loops=1)
-> Bitmap Index Scan on idx1 (cost=0.00..115.76 rows=10101 width=0) (actual time=1.203..1.203 rows=471 loops=1)
Index Cond: (b ~~ '%abcd%'::text)
-> Bitmap Index Scan on idx2 (cost=0.00..28.75 rows=100 width=0) (actual time=1.040..1.040 rows=471 loops=1)
Index Cond: (c ~~ '%abcd%'::text)
Planning time: 0.323 ms
Execution time: 3.544 ms