Question

Context:

PostgreSQL 10, with 3667438 records in users table, the users table has a JSONB called social, we usually use a strategy of indexing computed function outputs, so we can aggregate information into a single index. The output of the engagement(social) function it a double precision numeric type.

Problem:

The problematic clause is ORDER BY engagement(social) DESC NULLS LAST, there is also a btree index idx_in_social_engagement with DESC NULLS LAST attached to this data.

Fast query:

EXPLAIN ANALYZE
SELECT  "users".* FROM "users"
WHERE (follower_count(social) < 500000)
AND (engagement(social) > 0.03)
AND (engagement(social) < 0.25)
AND (peemv(social) < 533)
ORDER BY "users"."created_at" ASC
LIMIT 12 OFFSET 0;

Limit  (cost=0.43..52.25 rows=12 width=1333) (actual time=0.113..1.625 
rows=12 loops=1)
   ->  Index Scan using created_at_idx on users  (cost=0.43..7027711.55 rows=1627352 width=1333) (actual time=0.112..1.623 rows=12 loops=1)
         Filter: ((follower_count(social) < 500000) AND (engagement(social) > '0.03'::double precision) AND (engagement(social) <  '0.25'::double precision) AND (peemv(social) > '0'::double precision) AND (peemv(social) < '533'::double precision))
         Rows Removed by Filter: 8
 Planning time: 0.324 ms
 Execution time: 1.639 ms

Slow query:

EXPLAIN ANALYZE 
SELECT  "users".* FROM "users" 
WHERE (follower_count(social) < 500000) 
AND (engagement(social) > 0.03) 
AND (engagement(social) < 0.25) 
AND (peemv(social) > 0.0) 
AND (peemv(social) < 533) 
ORDER BY engagement(social) DESC NULLS LAST, "users"."created_at" ASC 
LIMIT 12 OFFSET 0;

Limit  (cost=2884438.00..2884438.03 rows=12 width=1341) (actual time=68011.728..68011.730 rows=12 loops=1)
->  Sort  (cost=2884438.00..2888506.38 rows=1627352 width=1341) (actual time=68011.727..68011.728 rows=12 loops=1)
        Sort Key: (engagement(social)) DESC NULLS LAST, created_at
        Sort Method: top-N heapsort  Memory: 45kB
        ->  Index Scan using idx_in_social_engagement on users  (cost=0.43..2847131.26 rows=1627352 width=1341) (actual time=0.082..67019.102 rows=1360633 loops=1)
            Index Cond: ((engagement(social) > '0.03'::double precision) AND (engagement(social) < '0.25'::double precision))
            Filter: ((follower_count(social) < 500000) AND (peemv(social) > '0'::double precision) AND (peemv(social) < '533'::double precision))
            Rows Removed by Filter: 85580
Planning time: 0.312 ms
Execution time: 68011.752 ms

The select goes with * because I need all the data stored in each row.

Update:

CREATE INDEX idx_in_social_engagement on influencers USING BTREE ( engagement(social) DESC NULLS LAST)

Exact index definition

Was it helpful?

Solution

Your ORDER BY clause is on:

engagement(social) DESC NULLS LAST, "users"."created_at" ASC

But I suspect your index is just on:

engagement(social) DESC NULLS LAST

So the index is not capable of fully supporting the ORDER BY.

You can reproduce the same issue without using either JSONB or expression indexes. You might be able to salvage the situation by creating a composite expression index over both columns in your ORDER BY.

If the PostgreSQL planner were infinitely wise, it might be able to use the existing index efficiently. It would have to march up engagement(social) DESC NULLS LAST until it collected 12 tuples which met all the rest of the filter requirements. Then it would have continue marching up that index until it collected all the rest of the tuples which were tied on engagement(social) with the 12th tuple (and which met the other criteria). Then it would have to re-sort all of the collected tuples on the full ORDER BY, and apply the LIMIT 12 to that expanded and re-sorted set. But the PostgreSQL planner is not infinitely wise.

OTHER TIPS

I would suspect that the culprit here is the lack of statistics on JSONB columns. Postgres doesn't keep any statistics on JSONB columns, instead it uses hardcoded estimates. If those estimates are very far off, which is quite likely to happen, this can lead to terrible query plans like your case.

In your good plan, Postgres first orders the data and then filters it. This is extremely fast for queries with LIMIT clauses and an index on the sorted column, if the filter doesn't remove many rows. The index is ordered, so getting the rows in order is extremely cheap. The limit clause means you only have to fetch a few rows until you have enough to satisfy it.

But if your filter excludes a lot of rows, e.g. if only 0.1% would match it, sorting first would require you to go through most of your table to find enough rows, because you filter out almost all of them. In this case, filtering first by index and then sorting is far faster. This is what your bad plan does, and obviously it's not a good fit for your data.

By far the best option would be to put the values you use here into their own columns. JSONB is extremely useful for some uses, but if you don't need the flexibility it provides the plain old relational way is far better.

A functional index does provide statistics, which should help your query. I would suspect that in your case this doesn't work well because you're passing the entire social column to the function, and it can't built a good statistics from that. You could try an index only on the engagement key in the social column, that might provide Postgres with the statistics necessary for a better query plan. See my own question on JSONB statistics for an example on how to do that.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top