Domanda

I'm having a very wierd dataset, where several records from large table has any data at all, but when they do it's hundredths of thousands of records. I'm trying to select only records that have data but I'm having some problems with index usage. I know you cannot usually "force" postgresql to use certain index but in this case it works.

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) > 0 ORDER BY count(frames.id) DESC;
 id | count  
----+--------
 31 | 123363
 28 | 121475
 24 | 110155
 21 | 108258
 22 | 106837
 25 |  89182
 26 |  87104
 27 |  86152
(8 rows)

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) = 0 ORDER BY count(frames.id) DESC;
....
(568 rows)

Two solutions I've found would be:

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 11697,645 ms


or

SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"
Time: 879,325 ms

Neither query seems to use index on match_id in frames table. It's understendable since normally it's not very selective, unfortunately here it would be really helpful. As:

SET enable_seqscan = OFF;
SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 1,239 ms

EXPLAIN for queries:

EXPLAIN for: SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"

                                 QUERY PLAN
-----------------------------------------------------------------------------
 HashAggregate  (cost=59253.47..59256.38 rows=290 width=155)
   ->  Hash Join  (cost=6.26..33716.73 rows=785746 width=155)
         Hash Cond: (frames.match_id = matches.id)
         ->  Seq Scan on frames  (cost=0.00..22906.46 rows=785746 width=4)
         ->  Hash  (cost=4.45..4.45 rows=145 width=155)
               ->  Seq Scan on matches  (cost=0.00..4.45 rows=145 width=155)
(6 rows)

EXPLAIN for: SELECT "matches".* FROM "matches" WHERE (EXISTS (SELECT id FROM frames WHERE frames.match_id = matches.id LIMIT 1)) QUERY PLAN


Seq Scan on matches  (cost=0.00..41.17 rows=72 width=155)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Limit  (cost=0.00..0.25 rows=1    width=4)                                                                                                                      
       ->  Seq Scan on frames  (cost=0.00..24870.83 rows=98218 width=4)                                                                                           
                Filter: (match_id = matches.id)                                                                                                                      

(6 rows)

SET enable_seqscan = OFF;

EXPLAIN SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1); QUERY PLAN


Seq Scan on matches  (cost=10000000000.00..10000000118.37 rows=72 width=155)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Limit  (cost=0.00..0.79 rows=1 width=0)
          ->  Index Scan using index_frames_on_match_id on frames  (cost=0.00..81762.68 rows=104066 width=0)
                Index Cond: (match_id = matches.id)

(6 rows)

Any suggestions how to tweek the query to use index here? Maybe other ways to chec for existance of recrs that would execute closer to 1ms I get out of index then 11s ?

PS. I did run ANALYZE, VACUM ANALYZE, all the steps normally suggested to improve index usage.

EDIT Thanks for David Aldridge pointing out that LIMIT 1 might be actually hindering query planner I've gotten now to:

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
Time: 163,803 ms

With the plan:

EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Nested Loop  (cost=25455.58..25457.90 rows=8 width=155)
   ->  HashAggregate  (cost=25455.58..25455.66 rows=8 width=4)
         ->  Seq Scan on frames  (cost=0.00..23374.26 rows=832526 width=4)
   ->  Index Scan using matches_pkey on matches  (cost=0.00..0.27 rows=1 width=155)
         Index Cond: (id = frames.match_id)
(5 rows)

Still 100 times slower with index only version (probably because of Seq Scan + Hash Aggregate on frames that's still performed)

È stato utile?

Soluzione

In the EXISTS-based alternative, the LIMIT clause is redundant but might not be helping the optimiser.

Try:

SELECT "matches".*
FROM   "matches"
WHERE  EXISTS (SELECT 1
                 FROM frames
                WHERE frames.match_id = matches.id);
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top