Postgresql: How to optimize performance of a query which uses CTEs and has a jsonb column?

https://dba.stackexchange.com/questions/167854

06-10-2020
|

Question

I'm querying through 3 joined tables and I'm using CTEs and flattering (converting jsonb to a tabbular form) one of the table's jsonb column then querying through that dynamically made table so I can calculate the individual data in that jsonb column.

Using this code:

WITH students_query AS (
    SELECT student_number, "examYear", school_name, subjects
    FROM students
    INNER JOIN schools ON students.school_id = schools.school_number
    INNER JOIN results ON results.id = students.subjects_id
    WHERE
        "examYear" = '2010'
        AND
        "examType" = 'CSEE'
), subjects_array AS (
    SELECT jsonb_array_elements(subjects) AS subject_list
    FROM students_query
), unwrapper AS (
    SELECT 
        x.*
        FROM subjects_array,
        jsonb_to_record(subject_list) AS x(
            subject varchar(25),
            grade varchar(2)
        )
), failures AS (
    SELECT 
        COUNT(*)::numeric AS fails
        FROM unwrapper
        WHERE 
            "subject" = 'B/MATH' AND "grade" IN ('D', 'F', 'X')

), passes AS (
    SELECT
        COUNT(*)::numeric AS passes
        FROM unwrapper
        WHERE
            "subject" = 'B/MATH' AND "grade" IN ('A', 'B', 'C')
), final AS (
    SELECT
        COUNT(*)::numeric AS allStudents
        FROM unwrapper
        WHERE 
            "subject" = 'B/MATH'
)


SELECT 
    fails AS "Number of Fails",
    passes AS "Number of passes",
    allStudents AS "Number of All Students",
    ROUND((fails/allStudents *100), 2) AS "Percent of Fails",
    ROUND((passes/allStudents *100), 2) AS "Percent of Passes"
    FROM failures, passes, final;

Problem is slowness.

This particular query takes about 10 seconds to finish, But this is an important query and most users will be using it so I would love to optimize it.

Steps I've made

Indexes: I've made some indexes, but I'm not sure if they even do a thing.

drop index if exists fk_idx_results;
drop index if exists idx_results_subjects;
drop index if exists fk_idx_schools;
drop index if exists idx_schools_name;
drop index if exists fk_idx_students;

create index fk_idx_results on results("id");
create index idx_results_subjects on results("subjects", "examYear");

create index fk_idx_schools on schools("school_number");
create index idx_schools_name on schools("school_name");

create index fk_idx_students on students("id", "subjects_id", "school_id");

Settings I also appened some settings before the query which kinda got it to 9 seconds.

SET cpu_index_tuple_cost = .0005;
SET random_page_cost = 2;

Ask now I'm asking for help, I'm new to postgresql and the whole big database optimization.

Here is the explain analyze query report.

"Nested Loop  (cost=6330354.67..6330354.76 rows=1 width=160) (actual time=10103.862..10103.865 rows=1 loops=1)"
"  CTE students_query"
"    ->  Hash Join  (cost=390430.66..607236.38 rows=476181 width=347) (actual time=1378.337..3543.161 rows=458487 loops=1)"
"          Hash Cond: (students.school_id = schools.school_number)"
"          ->  Hash Join  (cost=390237.54..600495.77 rows=476181 width=326) (actual time=1376.496..3407.941 rows=458487 loops=1)"
"                Hash Cond: (students.subjects_id = results.id)"
"                ->  Seq Scan on students  (cost=0.00..93850.85 rows=4658285 width=36) (actual time=0.029..616.052 rows=4658285 loops=1)"
"                ->  Hash  (cost=362894.28..362894.28 rows=476181 width=340) (actual time=1375.636..1375.636 rows=458487 loops=1)"
"                      Buckets: 16384  Batches: 64  Memory Usage: 2876kB"
"                      ->  Seq Scan on results  (cost=0.00..362894.28 rows=476181 width=340) (actual time=0.020..1187.420 rows=458487 loops=1)"
"                            Filter: (("examYear" = '2010'::text) AND ("examType" = 'CSEE'::text))"
"                            Rows Removed by Filter: 4199798"
"          ->  Hash  (cost=113.61..113.61 rows=6361 width=33) (actual time=1.831..1.831 rows=6361 loops=1)"
"                Buckets: 8192  Batches: 1  Memory Usage: 469kB"
"                ->  Seq Scan on schools  (cost=0.00..113.61 rows=6361 width=33) (actual time=0.009..0.857 rows=6361 loops=1)"
"  CTE subjects_array"
"    ->  CTE Scan on students_query  (cost=0.00..246423.67 rows=47618100 width=32) (actual time=1378.350..4638.517 rows=3473367 loops=1)"
"  CTE unwrapper"
"    ->  Nested Loop  (cost=0.00..1904724.00 rows=47618100 width=80) (actual time=1378.369..8489.830 rows=3473367 loops=1)"
"          ->  CTE Scan on subjects_array  (cost=0.00..952362.00 rows=47618100 width=32) (actual time=1378.351..5344.083 rows=3473367 loops=1)"
"          ->  Function Scan on jsonb_to_record x  (cost=0.00..0.01 rows=1 width=80) (actual time=0.001..0.001 rows=1 loops=3473367)"
"  CTE failures"
"    ->  Aggregate  (cost=1249984.05..1249984.07 rows=1 width=32) (actual time=9379.408..9379.409 rows=1 loops=1)"
"          ->  CTE Scan on unwrapper  (cost=0.00..1249975.13 rows=3571 width=0) (actual time=1378.423..9342.868 rows=387778 loops=1)"
"                Filter: (((subject)::text = 'B/MATH'::text) AND ((grade)::text = ANY ('{D,F,X}'::text[])))"
"                Rows Removed by Filter: 3085589"
"  CTE passes"
"    ->  Aggregate  (cost=1249984.05..1249984.07 rows=1 width=32) (actual time=365.217..365.217 rows=1 loops=1)"
"          ->  CTE Scan on unwrapper unwrapper_1  (cost=0.00..1249975.13 rows=3571 width=0) (actual time=0.093..363.892 rows=24002 loops=1)"
"                Filter: (((subject)::text = 'B/MATH'::text) AND ((grade)::text = ANY ('{A,B,C}'::text[])))"
"                Rows Removed by Filter: 3449365"
"  CTE final"
"    ->  Aggregate  (cost=1072002.48..1072002.49 rows=1 width=32) (actual time=359.222..359.222 rows=1 loops=1)"
"          ->  CTE Scan on unwrapper unwrapper_2  (cost=0.00..1071407.25 rows=238090 width=0) (actual time=0.005..339.228 rows=411822 loops=1)"
"                Filter: ((subject)::text = 'B/MATH'::text)"
"                Rows Removed by Filter: 3061545"
"  ->  Nested Loop  (cost=0.00..0.05 rows=1 width=64) (actual time=9744.630..9744.632 rows=1 loops=1)"
"        ->  CTE Scan on failures  (cost=0.00..0.02 rows=1 width=32) (actual time=9379.410..9379.411 rows=1 loops=1)"
"        ->  CTE Scan on passes  (cost=0.00..0.02 rows=1 width=32) (actual time=365.219..365.220 rows=1 loops=1)"
"  ->  CTE Scan on final  (cost=0.00..0.02 rows=1 width=32) (actual time=359.224..359.225 rows=1 loops=1)"
"Planning time: 0.546 ms"
"Execution time: 10186.026 ms"

Additional Information Postgresql version: 9.6.1

The results table you can see above has about 4.6 Million rows which all contains the subjects::jsonb column which (I guess) makes a big difference there.

The students table has 4.6 Million (exactly like results table) rows, this is for all students whose subjects results are in the results table linked with students.subjects_id.

The schools table has 6323 rows, which are linked with students table at schools.school_number = students.school_id.

Sample subjects column output.

[{"grade": "D", "subject": "HIST"}, {"grade": "D", "subject": "GEO"}, {"grade": "D", "subject": "KISW"}, {"grade": "C", "subject": "ENGL"}, {"grade": "D", "subject": "LIT ENG"}]
 [{"grade": "D", "subject": "CIV"}, {"grade": "D", "subject": "GEO"}, {"grade": "D", "subject": "KISW"}, {"grade": "D", "subject": "ENGL"}]
 [{"grade": "C", "subject": "CIV"}, {"grade": "D", "subject": "KISW"}, {"grade": "B", "subject": "ENGL"}, {"grade": "A", "subject": "CHEM"}, {"grade": "A", "subject": "BIO"}, {"grade": "B", "subject": "ENG SC"},{"grade": "C", "subject": "B/MATH"}, {"grade": "D", "subject": "ELECT INST"}, {"grade": "D", "subject": "ELECT ENG SC"}, {"grade": "F", "subject": "ELECT DRAUGHT"}]
 [{"grade": "F", "subject": "CIV"}, {"grade": "F", "subject": "GEO"}, {"grade": "C", "subject": "E/D/KIISLAMU"}, {"grade": "F", "subject": "KISW"}, {"grade": "F", "subject": "ENGL"}, {"grade": "F", "subject": "LIT ENG"}, {"grade": "C", "subject": "ARABIC"}]
 [{"grade": "F", "subject": "CIV"}, {"grade": "F", "subject": "HIST"}, {"grade": "F", "subject": "GEO"}, {"grade": "F", "subject": "KISW"}, {"grade": "F", "subject": "ENGL"}, {"grade": "F", "subject": "BIO"}, {"grade": "F", "subject": "B/MATH"}]

Solution

I agree your structure seems a little funny and not normalized. Your indexes aren't doing much but that is fixable. You probably want to index elements of the JSON like subject and grade.

Since that isn't a trivial subject to explain, you might want to check out this blog post where he walks through doing that with an example set:

http://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/?ckattempt=1

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange