Pergunta

My database consists of apps and their reviews (schema below). I'm trying to answer the following question:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

This is the query that I've come up with to try and answer that question:

select
  review_windows.review_window_start,
  id,
  slug,
  review_total,
  earliest_review
from
  (
    select
      date_trunc('day', review_windows.review_windows) :: date as review_window_start
    from
      generate_series(
        (
          SELECT
            min(reviews.review_date)
          FROM
            reviews
        ),
        (
          SELECT
            max(reviews.review_date)
          FROM
            reviews
        ),
        '1 day'
      ) review_windows
    order by
      1 desc
  ) review_windows
  left join lateral (
    SELECT
      apps.id,
      apps.slug,
      count(reviews.*) as review_total,
      min(reviews.review_date) as earliest_review
    FROM
      reviews
      INNER JOIN apps ON apps.id = reviews.app_id
    where
      reviews.review_date >= review_windows.review_window_start
    group by
      1,
      2
    having
      min(reviews.review_date) >= review_windows.review_window_start
    order by
      3 desc,
      4 desc
    limit
      2
  ) apps_most_reviews on true;

It is extremely slow and I'm not sure why. If I want any kinds of results I use week instead of day in the generate_series call and even then that might take a minute or even longer.

Where should I start when debugging a performance issue like this?

Visualized query plan here

There are ~5K rows in apps and ~400K rows in reviews so it's a mystery to me why this is taking so long.

Running the individual subquery that is run for each entry in the lateral join given a single date only takes 161 ms (below) and the subquery for generate_series only takes 4 ms. I'm clearly doing something very wrong. Any help would be much appreciated!

Individual subquery with an explicit date

SELECT
  apps.id,
  apps.slug,
  count(reviews.*) as review_total,
  min(reviews.review_date) as earliest_review
FROM
  reviews
  INNER JOIN apps ON apps.id = reviews.app_id
where
  reviews.review_date >= '2018-04-17'::date
group by
  1,
  2
having
  min(reviews.review_date) >= '2018-04-17'::date
order by
  3 desc,
  4 desc
limit
  2

Tables

apps

Schema

|   | column_name | data_type    | is_nullable | foreign_key |
|---|-------------|--------------|-------------|-------------|
| 1 | id          | int4         | NO          |             |
| 2 | name        | varchar(255) | NO          |             |
| 3 | slug        | varchar(255) | NO          |             |

Indexes

| index_name      | index_algorithm | is_unique | column_name |
|-----------------|-----------------|-----------|-------------|
| apps_slug_index | BTREE           | t         | slug        |
| apps_pkey       | BTREE           | t         | id          |

reviews

Schema

|   | column_name   | data_type    | is_nullable | foreign_key     |
|---|---------------|--------------|-------------|-----------------|
| 1 | id            | int4         | NO          |                 |
| 2 | rating        | int4         | NO          |                 |
| 3 | review_date   | date         | NO          |                 |
| 4 | reviewer_name | varchar(255) | NO          |                 |
| 5 | review_body   | text         | NO          |                 |
| 6 | app_id        | int4         | NO          | public.apps(id) |

Indexes

| index_name                  | index_algorithm | is_unique | column_name   |
|-----------------------------|-----------------|-----------|---------------|
| reviews_reviewer_name_index | BTREE           | f         | reviewer_name |
| reviews_review_date_index   | BTREE           | f         | review_date   |
| reviews_pkey                | BTREE           | t         | id            |
| reviews_app_id_index        | BTREE           | f         | app_id        |
Foi útil?

Solução

The main reason for the slowness is that you aggregate over the big table from scratch for every iteration of the lateral sibquery. Compute earliest review & current total count per app in a CTE once and base the lateral subquery on it. I discussed that and some other optizations under your predating related question on SO:

One difference: In this question you also return an additional attribute of the app (slug), so we need to join to table apps after all. Literally: join to apps after aggregating reviews. That's cheaper.

Another difference: earliest_review as additioal tiebreaker. That's just a gradual improvement, though, as there can stiill be 0-n winners per review_window_start. Instead of picking the (arbitrary) top two ( limit 2), select the first one, with ties. (Wording already hints at the new technique in Postgres 13; see below.)

WITH cte AS MATERIALIZED (
   SELECT a.id, a.slug, r.earliest_review, r.review_total
   FROM   apps a
   JOIN    (
      SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS review_total
      FROM   reviews
      GROUP  BY 1
      ) r ON r.app_id = a.id
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT app_id, slug, review_total, earliest_review
   FROM  (
      SELECT app_id, slug, review_total, earliest_review
            ,  rank() OVER (ORDER BY review_total DESC, earliest_review DESC) AS rnk
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ) sub
   WHERE  rnk = 1
   ) a ON true;

Like in the related answer on SO, it will be cheaper yet in Postgres 13 using WITH TIES. See:

...
LEFT  JOIN LATERAL (
   SELECT app_id, slug, review_total, earliest_review
   FROM   cte c
   WHERE  earliest_review >= d.review_window_start
   ORDER  BY review_total DESC, earliest_review DESC
   FETCH  FIRST 1 ROWS WITH TIES  -- new!
   ) a ON true;
Licenciado em: CC-BY-SA com atribuição
Não afiliado a dba.stackexchange
scroll top