Question

I have a query that needs to update a table with ~14 million records. It's getting the value it needs to update from another table via a join table. Like this...

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
  ->  Nested Loop  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=731559.58..1135721.70 rows=717709 width=26)
              ->  Hash Join  (cost=731559.58..1132133.16 rows=717709 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=397447.29..397447.29 rows=14832429 width=14)
                          ->  Seq Scan on listings a  (cost=0.00..397447.29 rows=14832429 width=14)
                                Filter: (master_ext_id IS NULL)

Obviously, looking at the execution plan, you can see that this query is taking an extremely long time. I'm assuming at this point that it has to do with the number of rows that are involved in the query, but I need a way to speed this up somehow.

In addition to the ~14 million records in the listings table, there are ~15 million rows in the listing_to_external_id table and ~15 million in the external_ids table.

I've tried setting enable_seqscan to off, and it uses the indexes I've created, so I know it's just a case of the planner determining that a seq scan would be faster. I've also ANALYZE'd my tables.

I've tried limiting the rows updated by using the primary key on the listings table, hoping I might be able to loop through and update the rows a handful at a time. As you can see, this had little effect...

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
WHERE a.id >= 34649050
AND a.id <= 35649050
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
  ->  Nested Loop  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=212130.40..600005.71 rows=50585 width=26)
              ->  Hash Join  (cost=212130.40..599752.78 rows=50585 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=50355.96..50355.96 rows=1045401 width=14)
                          ->  Index Scan using listings_pkey on listings a  (cost=0.00..50355.96 rows=1045401 width=14)
                                Index Cond: ((id >= 34649050) AND (id <= 35649050))
                                Filter: (master_ext_id IS NULL)

I've tried tuning the settings on Postgres to better handle such a large query, but this seemed to have little effect as well. I can get into these settings if nothing can be done with the query itself.

I've also tried taking the result of the join between listing_to_external_id and external_ids and putting it into a table, indexing it, and then joining listings on that table. This resulted in a very similar execution plan/cost.

Not sure what else to do at this point. Just let the query run over the weekend and it's still running. Any suggestions?

Was it helpful?

Solution

You used listings table twice - one in the UPDATE and other one in FROM. Look at the first execution plan. It has a cartesian product (CROSS JOIN) of listings. You need listings only in UPDATE.

Try something like

UPDATE listings a
SET master_ext_id = c.id
FROM listing_to_external_id b
JOIN external_ids c on b.external_id = c.id
WHERE a.id = b.listing_id
 AND a.master_ext_id is null
 AND c.provider_id = 0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top