Postgresql: How to optimize performance of a query which uses CTEs and has a jsonb column?
-
06-10-2020 - |
Question
I'm querying through 3 joined tables and I'm using CTEs and flattering (converting jsonb to a tabbular form) one of the table's jsonb column then querying through that dynamically made table so I can calculate the individual data in that jsonb column.
Using this code:
WITH students_query AS (
SELECT student_number, "examYear", school_name, subjects
FROM students
INNER JOIN schools ON students.school_id = schools.school_number
INNER JOIN results ON results.id = students.subjects_id
WHERE
"examYear" = '2010'
AND
"examType" = 'CSEE'
), subjects_array AS (
SELECT jsonb_array_elements(subjects) AS subject_list
FROM students_query
), unwrapper AS (
SELECT
x.*
FROM subjects_array,
jsonb_to_record(subject_list) AS x(
subject varchar(25),
grade varchar(2)
)
), failures AS (
SELECT
COUNT(*)::numeric AS fails
FROM unwrapper
WHERE
"subject" = 'B/MATH' AND "grade" IN ('D', 'F', 'X')
), passes AS (
SELECT
COUNT(*)::numeric AS passes
FROM unwrapper
WHERE
"subject" = 'B/MATH' AND "grade" IN ('A', 'B', 'C')
), final AS (
SELECT
COUNT(*)::numeric AS allStudents
FROM unwrapper
WHERE
"subject" = 'B/MATH'
)
SELECT
fails AS "Number of Fails",
passes AS "Number of passes",
allStudents AS "Number of All Students",
ROUND((fails/allStudents *100), 2) AS "Percent of Fails",
ROUND((passes/allStudents *100), 2) AS "Percent of Passes"
FROM failures, passes, final;
Problem is slowness.
This particular query takes about 10 seconds to finish, But this is an important query and most users will be using it so I would love to optimize it.
Steps I've made
Indexes: I've made some indexes, but I'm not sure if they even do a thing.
drop index if exists fk_idx_results;
drop index if exists idx_results_subjects;
drop index if exists fk_idx_schools;
drop index if exists idx_schools_name;
drop index if exists fk_idx_students;
create index fk_idx_results on results("id");
create index idx_results_subjects on results("subjects", "examYear");
create index fk_idx_schools on schools("school_number");
create index idx_schools_name on schools("school_name");
create index fk_idx_students on students("id", "subjects_id", "school_id");
Settings I also appened some settings before the query which kinda got it to 9 seconds.
SET cpu_index_tuple_cost = .0005;
SET random_page_cost = 2;
Ask now I'm asking for help, I'm new to postgresql and the whole big database optimization.
Here is the explain analyze
query report.
"Nested Loop (cost=6330354.67..6330354.76 rows=1 width=160) (actual time=10103.862..10103.865 rows=1 loops=1)"
" CTE students_query"
" -> Hash Join (cost=390430.66..607236.38 rows=476181 width=347) (actual time=1378.337..3543.161 rows=458487 loops=1)"
" Hash Cond: (students.school_id = schools.school_number)"
" -> Hash Join (cost=390237.54..600495.77 rows=476181 width=326) (actual time=1376.496..3407.941 rows=458487 loops=1)"
" Hash Cond: (students.subjects_id = results.id)"
" -> Seq Scan on students (cost=0.00..93850.85 rows=4658285 width=36) (actual time=0.029..616.052 rows=4658285 loops=1)"
" -> Hash (cost=362894.28..362894.28 rows=476181 width=340) (actual time=1375.636..1375.636 rows=458487 loops=1)"
" Buckets: 16384 Batches: 64 Memory Usage: 2876kB"
" -> Seq Scan on results (cost=0.00..362894.28 rows=476181 width=340) (actual time=0.020..1187.420 rows=458487 loops=1)"
" Filter: (("examYear" = '2010'::text) AND ("examType" = 'CSEE'::text))"
" Rows Removed by Filter: 4199798"
" -> Hash (cost=113.61..113.61 rows=6361 width=33) (actual time=1.831..1.831 rows=6361 loops=1)"
" Buckets: 8192 Batches: 1 Memory Usage: 469kB"
" -> Seq Scan on schools (cost=0.00..113.61 rows=6361 width=33) (actual time=0.009..0.857 rows=6361 loops=1)"
" CTE subjects_array"
" -> CTE Scan on students_query (cost=0.00..246423.67 rows=47618100 width=32) (actual time=1378.350..4638.517 rows=3473367 loops=1)"
" CTE unwrapper"
" -> Nested Loop (cost=0.00..1904724.00 rows=47618100 width=80) (actual time=1378.369..8489.830 rows=3473367 loops=1)"
" -> CTE Scan on subjects_array (cost=0.00..952362.00 rows=47618100 width=32) (actual time=1378.351..5344.083 rows=3473367 loops=1)"
" -> Function Scan on jsonb_to_record x (cost=0.00..0.01 rows=1 width=80) (actual time=0.001..0.001 rows=1 loops=3473367)"
" CTE failures"
" -> Aggregate (cost=1249984.05..1249984.07 rows=1 width=32) (actual time=9379.408..9379.409 rows=1 loops=1)"
" -> CTE Scan on unwrapper (cost=0.00..1249975.13 rows=3571 width=0) (actual time=1378.423..9342.868 rows=387778 loops=1)"
" Filter: (((subject)::text = 'B/MATH'::text) AND ((grade)::text = ANY ('{D,F,X}'::text[])))"
" Rows Removed by Filter: 3085589"
" CTE passes"
" -> Aggregate (cost=1249984.05..1249984.07 rows=1 width=32) (actual time=365.217..365.217 rows=1 loops=1)"
" -> CTE Scan on unwrapper unwrapper_1 (cost=0.00..1249975.13 rows=3571 width=0) (actual time=0.093..363.892 rows=24002 loops=1)"
" Filter: (((subject)::text = 'B/MATH'::text) AND ((grade)::text = ANY ('{A,B,C}'::text[])))"
" Rows Removed by Filter: 3449365"
" CTE final"
" -> Aggregate (cost=1072002.48..1072002.49 rows=1 width=32) (actual time=359.222..359.222 rows=1 loops=1)"
" -> CTE Scan on unwrapper unwrapper_2 (cost=0.00..1071407.25 rows=238090 width=0) (actual time=0.005..339.228 rows=411822 loops=1)"
" Filter: ((subject)::text = 'B/MATH'::text)"
" Rows Removed by Filter: 3061545"
" -> Nested Loop (cost=0.00..0.05 rows=1 width=64) (actual time=9744.630..9744.632 rows=1 loops=1)"
" -> CTE Scan on failures (cost=0.00..0.02 rows=1 width=32) (actual time=9379.410..9379.411 rows=1 loops=1)"
" -> CTE Scan on passes (cost=0.00..0.02 rows=1 width=32) (actual time=365.219..365.220 rows=1 loops=1)"
" -> CTE Scan on final (cost=0.00..0.02 rows=1 width=32) (actual time=359.224..359.225 rows=1 loops=1)"
"Planning time: 0.546 ms"
"Execution time: 10186.026 ms"
Additional Information Postgresql version: 9.6.1
The results
table you can see above has about 4.6 Million rows which all contains the subjects
::jsonb column which (I guess) makes a big difference there.
The students
table has 4.6 Million (exactly like results
table) rows, this is for all students whose subjects results are in the results table linked with students.subjects_id
.
The schools
table has 6323 rows, which are linked with students
table at schools.school_number = students.school_id
.
Sample subjects column output.
[{"grade": "D", "subject": "HIST"}, {"grade": "D", "subject": "GEO"}, {"grade": "D", "subject": "KISW"}, {"grade": "C", "subject": "ENGL"}, {"grade": "D", "subject": "LIT ENG"}]
[{"grade": "D", "subject": "CIV"}, {"grade": "D", "subject": "GEO"}, {"grade": "D", "subject": "KISW"}, {"grade": "D", "subject": "ENGL"}]
[{"grade": "C", "subject": "CIV"}, {"grade": "D", "subject": "KISW"}, {"grade": "B", "subject": "ENGL"}, {"grade": "A", "subject": "CHEM"}, {"grade": "A", "subject": "BIO"}, {"grade": "B", "subject": "ENG SC"},{"grade": "C", "subject": "B/MATH"}, {"grade": "D", "subject": "ELECT INST"}, {"grade": "D", "subject": "ELECT ENG SC"}, {"grade": "F", "subject": "ELECT DRAUGHT"}]
[{"grade": "F", "subject": "CIV"}, {"grade": "F", "subject": "GEO"}, {"grade": "C", "subject": "E/D/KIISLAMU"}, {"grade": "F", "subject": "KISW"}, {"grade": "F", "subject": "ENGL"}, {"grade": "F", "subject": "LIT ENG"}, {"grade": "C", "subject": "ARABIC"}]
[{"grade": "F", "subject": "CIV"}, {"grade": "F", "subject": "HIST"}, {"grade": "F", "subject": "GEO"}, {"grade": "F", "subject": "KISW"}, {"grade": "F", "subject": "ENGL"}, {"grade": "F", "subject": "BIO"}, {"grade": "F", "subject": "B/MATH"}]
Solution
I agree your structure seems a little funny and not normalized. Your indexes aren't doing much but that is fixable. You probably want to index elements of the JSON like subject and grade.
Since that isn't a trivial subject to explain, you might want to check out this blog post where he walks through doing that with an example set:
http://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/?ckattempt=1