Pergunta

I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:

     Column      |            Type             |                     Modifiers                      
-----------------+-----------------------------+----------------------------------------------------
 id              | integer                     | not null default nextval('pings_id_seq'::regclass)
 url             | character varying(255)      | 
 monitor_id      | integer                     | 
 response_status | integer                     | 
 response_time   | integer                     | 
 created_at      | timestamp without time zone | 
 updated_at      | timestamp without time zone | 
 response_body   | text                        | 
Indexes:
    "pings_pkey" PRIMARY KEY, btree (id)
    "index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
    "index_pings_on_monitor_id" btree (monitor_id)

I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:

SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)

It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.

When I run EXPLAIN ANALYZE, this is what I get:

Bitmap Heap Scan on pings  (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
  Recheck Cond: (monitor_id = 3)
  Rows Removed by Index Recheck: 11643313
  Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
  Rows Removed by Filter: 324834
  ->  Bitmap Index Scan on index_pings_on_monitor_id  (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
        Index Cond: (monitor_id = 3)

So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.

Nothing I've tried makes the query any faster. How would you optimize and/or index it?

Foi útil?

Solução

Sequence of columns

Create a partial multicolumn index with the right sequence of columns. You have one:

"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)

But the sequence of columns is not serving you well. Reverse it:

CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;

The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance

As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.

You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL

Covering index

If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):

CREATE INDEX idx_pings_monitor_created
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL;  -- maybe

Or, you try this even ..

More radical partial index

Create a tiny helper function. Effectively a "global constant" in your db:

CREATE OR REPLACE FUNCTION f_ping_event_horizon()
  RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$;  -- One month in the past

Use it as condition in your index:

CREATE INDEX idx_pings_monitor_created_response_time
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL  -- maybe
AND   created_at > f_ping_event_horizon();

And your query looks like this now:

SELECT response_time
FROM   pings
WHERE  monitor_id = 3
AND    response_time IS NOT NULL
AND    created_at > '2014-03-03 20:23:07.254281'
AND    created_at > f_ping_event_horizon();

Aside: I trimmed some noise.

The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.

This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.

When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?

Why the function at all? This way, all the queries using the index can remain unchanged.

With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top