Question

I have following query, which I want to optimize.

    SELECT 
        a.household_id household_id, 
        age_of_youngest_woman, 
        b.number_of_children,
        c.number_of_men,
        fertility_cond_prob_number_of_children.cond_prob cond_prob_number_of_children,
        fertility_cond_age.cond_prob cond_prob_age,
        fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob total_cond_prob,
        random() <= (874. / 1703.) is_newborn_male
    FROM
        (
            SELECT household_id, MIN(age) age_of_youngest_woman
            FROM person
            WHERE 
                (user_id = 1) and
                (gender = 'FEMALE') and
                (age >= 18)
            GROUP BY household_id
        ) a
        LEFT JOIN
        (
            SELECT household_id, COUNT(*) number_of_children
            FROM person
            WHERE 
                (user_id = 1) and
                (gender = 'CHILD')
            GROUP BY household_id
        ) b ON (a.household_id = b.household_id)
        LEFT JOIN
        (
            SELECT household_id, COUNT(*) number_of_men
            FROM person
            WHERE 
                (user_id = 1) and
                (gender = 'MALE') and
                (age >= 18)
            GROUP BY household_id
        ) c ON (a.household_id = c.household_id)
        LEFT JOIN fertility_cond_prob_number_of_children ON (fertility_cond_prob_number_of_children.number_of_children = b.number_of_children)
        LEFT JOIN fertility_cond_age ON (fertility_cond_age.age = age_of_youngest_woman)
    WHERE 
        (c.number_of_men > 0) and
        (random() <= (fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob))

EXPLAIN ANALYZE returns following information:

Merge Join  (cost=20366.67..853430.69 rows=34797455 width=44) (actual time=1330.609..1641.402 rows=224 loops=1)    
  Merge Cond: (c.household_id = public.person.household_id)    
  ->  Sort  (cost=4806.12..4829.66 rows=9416 width=16) (actual time=492.839..546.397 rows=25098 loops=1)    
        Sort Key: c.household_id    
        Sort Method: external merge  Disk: 640kB    
        ->  Subquery Scan on c  (cost=3972.76..4184.62 rows=9416 width=16) (actual time=232.953..367.689 rows=25259 loops=1)    
              ->  HashAggregate  (cost=3972.76..4090.46 rows=9416 width=8) (actual time=232.946..288.922 rows=25259 loops=1)    
                    Filter: (count(*) > 0)    
                    ->  Seq Scan on person  (cost=0.00..3737.68 rows=31344 width=8) (actual time=7.366..137.853 rows=38497 loops=1)    
                          Filter: ((age >= 18) AND (user_id = 1) AND ((gender)::text = 'MALE'::text))    
                          Rows Removed by Filter: 64856    
  ->  Materialize  (cost=15560.55..67482.77 rows=739113 width=44) (actual time=836.591..1049.115 rows=352 loops=1)    
        ->  Merge Join  (cost=15560.55..65634.99 rows=739113 width=44) (actual time=836.577..1047.666 rows=352 loops=1)    
              Merge Cond: (public.person.household_id = b.household_id)    
              Join Filter: (random() <= (fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob))    
              Rows Removed by Join Filter: 11054    
              ->  Sort  (cost=4728.64..4747.85 rows=7684 width=20) (actual time=451.992..506.614 rows=26755 loops=1)    
                    Sort Key: public.person.household_id    
                    Sort Method: external merge  Disk: 888kB    
                    ->  Hash Join  (cost=3912.57..4232.73 rows=7684 width=20) (actual time=208.538..357.160 rows=26755 loops=1)    
                          Hash Cond: ((min(public.person.age)) = fertility_cond_age.age)    
                          ->  HashAggregate  (cost=3908.20..4010.65 rows=10245 width=12) (actual time=208.048..263.094 rows=26755 loops=1)    
                                ->  Seq Scan on person  (cost=0.00..3737.68 rows=34104 width=12) (actual time=1.612..111.773 rows=42369 loops=1)    
                                      Filter: ((age >= 18) AND (user_id = 1) AND ((gender)::text = 'FEMALE'::text))    
                                      Rows Removed by Filter: 60984    
                          ->  Hash  (cost=2.50..2.50 rows=150 width=12) (actual time=0.464..0.464 rows=150 loops=1)    
                                Buckets: 1024  Batches: 1  Memory Usage: 6kB    
                                ->  Seq Scan on fertility_cond_age  (cost=0.00..2.50 rows=150 width=12) (actual time=0.019..0.233 rows=150 loops=1)    
              ->  Materialize  (cost=10831.91..11120.48 rows=57715 width=24) (actual time=380.522..455.086 rows=14412 loops=1)    
                    ->  Sort  (cost=10831.91..10976.20 rows=57715 width=24) (actual time=380.504..411.816 rows=14412 loops=1)    
                          Sort Key: b.household_id    
                          Sort Method: external merge  Disk: 480kB    
                          ->  Merge Join  (cost=4205.69..5081.12 rows=57715 width=24) (actual time=221.294..301.093 rows=14412 loops=1)    
                                Merge Cond: (fertility_cond_prob_number_of_children.number_of_children = b.number_of_children)    
                                ->  Sort  (cost=135.34..140.19 rows=1940 width=12) (actual time=0.098..0.107 rows=7 loops=1)    
                                      Sort Key: fertility_cond_prob_number_of_children.number_of_children    
                                      Sort Method: quicksort  Memory: 17kB    
                                      ->  Seq Scan on fertility_cond_prob_number_of_children  (cost=0.00..29.40 rows=1940 width=12) (actual time=0.015..0.051 rows=25 loops=1)    
                                ->  Sort  (cost=4070.35..4085.23 rows=5950 width=16) (actual time=221.176..247.951 rows=14412 loops=1)    
                                      Sort Key: b.number_of_children    
                                      Sort Method: quicksort  Memory: 819kB    
                                      ->  Subquery Scan on b  (cost=3578.32..3697.32 rows=5950 width=16) (actual time=118.096..193.664 rows=14412 loops=1)    
                                            ->  HashAggregate  (cost=3578.32..3637.82 rows=5950 width=8) (actual time=118.090..147.604 rows=14412 loops=1)    
                                                  ->  Seq Scan on person  (cost=0.00..3479.30 rows=19806 width=8) (actual time=30.973..70.129 rows=20025 loops=1)    
                                                        Filter: ((user_id = 1) AND ((gender)::text = 'CHILD'::text))    
                                                        Rows Removed by Filter: 83328    

What can I do in order to improve performance of the query?

I tried to add indices, but this made things worse (the query runs faster without indices).

Update 1:

Query

    SELECT 
        a.household_id household_id, 
        age_of_youngest_woman, 
        a.number_of_children,
        a.number_of_men,
        fertility_cond_prob_number_of_children.cond_prob cond_prob_number_of_children,
        fertility_cond_age.cond_prob cond_prob_age,
        fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob total_cond_prob,
        random() <= (874. / 1703.) is_newborn_male
    FROM
        (SELECT 
            household_id, 
            MIN(CASE WHEN 
                    (gender = 'FEMALE') and 
                    (age >= 18)
                THEN age
                END) age_of_youngest_woman,
            COUNT(CASE WHEN (gender = 'CHILD')
                THEN 1
                END) number_of_children,
            COUNT(CASE WHEN (gender = 'MALE') and 
                            (age >= 18)
                THEN 1
                END) number_of_men
        FROM person
        WHERE user_id = 1 
        GROUP BY household_id) a
        JOIN fertility_cond_prob_number_of_children ON (fertility_cond_prob_number_of_children.number_of_children = a.number_of_children)
        JOIN fertility_cond_age ON (fertility_cond_age.age = a.age_of_youngest_woman)                   
    WHERE 
        (a.number_of_men > 0) and
        (random() <= (fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob))

has following performance characteristics:

Hash Join  (cost=21783.55..21871.65 rows=6 width=44) (actual time=701.418..3042.547 rows=247 loops=1)
  Hash Cond: ((min(CASE WHEN (((person.gender)::text = 'FEMALE'::text) AND (person.age >= 18)) THEN person.age ELSE NULL::integer END)) = fertility_cond_age.age)
  Join Filter: (random() <= (fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob))
  Rows Removed by Join Filter: 18741
  ->  Nested Loop  (cost=21779.17..21866.82 rows=19 width=36) (actual time=696.983..2949.993 rows=25647 loops=1)
        Join Filter: ((count(CASE WHEN ((person.gender)::text = 'CHILD'::text) THEN 1 ELSE NULL::integer END)) = fertility_cond_prob_number_of_children.number_of_children)
        Rows Removed by Join Filter: 615528
        ->  Seq Scan on fertility_cond_prob_number_of_children  (cost=0.00..29.40 rows=1940 width=12) (actual time=0.007..0.098 rows=25 loops=1)
        ->  Materialize  (cost=21779.17..21779.23 rows=2 width=28) (actual time=27.894..76.814 rows=25647 loops=25)
              ->  HashAggregate  (cost=21779.17..21779.20 rows=2 width=50) (actual time=696.954..764.681 rows=25647 loops=1)
                    Filter: (count(CASE WHEN (((person.gender)::text = 'MALE'::text) AND (person.age >= 18)) THEN 1 ELSE NULL::integer END) > 0)
                    Rows Removed by Filter: 8112
                    ->  Seq Scan on person  (cost=0.00..21648.46 rows=4357 width=50) (actual time=13.910..343.198 rows=106158 loops=1)
                          Filter: (user_id = 1)
  ->  Hash  (cost=2.50..2.50 rows=150 width=12) (actual time=0.480..0.480 rows=150 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 6kB
        ->  Seq Scan on fertility_cond_age  (cost=0.00..2.50 rows=150 width=12) (actual time=0.016..0.235 rows=150 loops=1)
Total runtime: 3045.405 ms

enter image description here

Table definitions:

CREATE TABLE fertility_cond_prob_number_of_children(number_of_children integer, cond_prob double precision);
CREATE TABLE fertility_cond_age(age integer, cond_prob double precision);
CREATE TABLE fertility_households(household_id bigint, user_id bigint, age_of_woman integer, number_of_children integer);
CREATE TABLE person (
    id                     SERIAL,
    user_id                 bigint NOT NULL,
    age                     integer NOT NULL,
    monthly_income          double precision NOT NULL,
    gender                  character varying(10),    
    household_id           bigint);
Was it helpful?

Solution

Try something like this:

SELECT 
    a.household_id, 
    a.age_of_youngest_woman, 
    a.number_of_children,
    a.number_of_men,
    fertility_cond_prob_number_of_children.cond_prob cond_prob_number_of_children,
    fertility_cond_age.cond_prob cond_prob_age,
    fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob total_cond_prob,
    random() <= (874. / 1703.) is_newborn_male
FROM
    (SELECT household_id, 
            MIN(CASE WHEN (gender = 'FEMALE') 
                          and (age >= 18)
                     THEN age
                END) age_of_youngest_woman,
            COUNT(CASE WHEN (gender = 'CHILD')
                       THEN 1
                  END) number_of_children,
            COUNT(CASE WHEN (gender = 'MALE')
                            and (age >= 18)
                       THEN 1
                  END) number_of_men
     FROM person
     WHERE user_id = 1 
     GROUP BY household_id) a
JOIN fertility_cond_prob_number_of_children ON (fertility_cond_prob_number_of_children.number_of_children = a.number_of_children)
JOIN fertility_cond_age ON (fertility_cond_age.age = a.age_of_youngest_woman)
WHERE 
    (a.number_of_men > 0) and
    (random() <= (fertility_cond_prob_number_of_children.cond_prob * fertility_cond_age.cond_prob))

I changed 3 inner table scans to 1 scan using some CASE statements and replaced left joins with simple joins (no difference because of WHERE clause). It should speed up the whole query.

You may need to correct some spelling mistakes before it runs correctly, I haven't tested it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top