Question

I have a table that has a jsonb column which has an array of objects.

Looks like this for every row.

[{grade: 'A', subject: 'MATH'}, {grade: 'B', subject: 'PHY'}...]

Now querying over it thanks to this post https://stackoverflow.com/a/30592076/2405689 .

Problem comes it takes at least 2.4 secs to finish querying a count of all students with grade IN (A, B, C).

I would like to get some help on indexing this because, the indexes I have tried didn't do anything.

DROP INDEX idx_subjects_subject;
DROP INDEX idx_subjects_grade;
CREATE INDEX idx_subjects_subject ON results USING GIN((subjects-> 'subject'));
CREATE INDEX idx_subjects_grade ON results USING GIN((subjects-> 'grade'));

Also did (separately):

DROP INDEX idx_subjects_standard;
CREATE INDEX idx_subjects_standard ON results USING GIN(subjects);

I'm querying them like this.

SELECT COUNT(*)
FROM results
WHERE EXISTS 
    (
        SELECT 1 
        FROM jsonb_array_elements(subjects) AS j(data) 
        WHERE (data #>> '{subject}') LIKE '%MATH%' 
        AND 
        (data #>> '{grade}') IN ('A', 'B', 'C')
    )
AND
    "examYear" = '2010'
AND
    "examType" = 'CSEE'
;

I also tried querying it like this :

SELECT COUNT(*)
FROM results
WHERE EXISTS 
    (
    SELECT 1 
    FROM jsonb_array_elements(subjects) AS j(data) 
    WHERE data @> '{"subject": "B/MATH", "grade": "A"}'
    OR data @> '{"subject": "B/MATH", "grade": "B"}'
    OR data @> '{"subject": "B/MATH", "grade": "C"}'

    )
AND
    "examYear" = '2010'
AND
    "examType" = 'CSEE'
;

But this gave me a reversed effect (3 secs query).

Here is my explain analyze block.

"Aggregate  (cost=1888617.19..1888617.20 rows=1 width=8) (actual time=2517.116..2517.117 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on results  (cost=470767.18..1888021.96 rows=238090 width=0) (actual time=680.456..2514.633 rows=24002 loops=1)"
"        Recheck Cond: ("examYear" = '2010'::text)"
"        Rows Removed by Index Recheck: 557945"
"        Filter: (("examType" = 'CSEE'::text) AND (SubPlan 1))"
"        Rows Removed by Filter: 500054"
"        Heap Blocks: exact=40156 lossy=53452"
"        ->  Bitmap Index Scan on idx_results_subjects  (cost=0.00..470707.66 rows=528405 width=0) (actual time=672.375..672.375 rows=524056 loops=1)"
"              Index Cond: ("examYear" = '2010'::text)"
"        SubPlan 1"
"          ->  Function Scan on jsonb_array_elements j  (cost=0.00..2.13 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=458487)"
"                Filter: (((data #>> '{subject}'::text[]) ~~ '%MATH%'::text) AND ((data #>> '{grade}'::text[]) = ANY ('{A,B,C}'::text[])))"
"                Rows Removed by Filter: 8"
"Planning time: 0.126 ms"
"Execution time: 2517.145 ms"

I'm on postgresql: 9.6.1

Was it helpful?

Solution

The following code will be able to use a GIN index on the jsonb column:

SELECT COUNT(*) FROM results
WHERE subjects @> '[{"subject": "B/MATH", "grade": "A"}]'

The difference to your example is that I don't unpack the array, I just query the jsonb column directly. The GIN index can be used for the @> operator on jsonb columns, but it can't be used for arbitrary functions. You're using the @> operator in the end, but not on the actual jsonb column but on your extracted json.

Indexing a LIKE condition with a wildcard at the beginning is far, far more difficult. You would need a trigram index for this, but I've no idea how to use one for data inside an array in a jsonb column, if that is even possible. You should strongly consider storing this information in a more formal way that avoids having to use LIKE at all.

A LIKE that doesn't have a wildcard at the beginning like 'MATH%' can use a btree index under some conditions (see https://www.postgresql.org/docs/9.5/static/indexes-types.html for some details, you have to pay attention to the locale for this).

If this weren't an array, but a plain object inside the jsonb column, you could use a functional index like e.g.

CREATE INDEX ON results((subject->>'grade'));

This is maybe also possible with an array inside jsonb, but I can't think of a reasonable way to do this right now.

Your current schema makes everything an order of magnitude more difficult to write and makes it very hard to impossible to properly take advantage of indexes. If you have the option, consider storing this data in a table with a subject and a grade column, this would make this problem far easier.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top