سؤال

I need the lowest value for runnerId.

This query:

SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ;

takes 80 ms (1968 result rows).

This:

SELECT min("runnerId") FROM betlog WHERE "marketId" = '107416794' ;

takes 1600 ms.

Is there a faster way to find the minimum, or should I calc the min in my java program?

"Result  (cost=100.88..100.89 rows=1 width=0)"
"  InitPlan 1 (returns $0)"
"    ->  Limit  (cost=0.00..100.88 rows=1 width=9)"
"          ->  Index Scan using runneridindex on betlog  (cost=0.00..410066.33 rows=4065 width=9)"
"                Index Cond: ("runnerId" IS NOT NULL)"
"                Filter: ("marketId" = 107416794::bigint)"

CREATE INDEX marketidindex
  ON betlog
  USING btree
  ("marketId" COLLATE pg_catalog."default");

Another idea:

SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" LIMIT 1 >1600ms
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" >>100ms

How can a LIMIT slow the query down?

هل كانت مفيدة؟

المحلول

What you need is a multi-column index:

CREATE INDEX betlog_mult_idx ON betlog ("marketId", "runnerId");

If interested, you'll find in-depth information about multi-column indexes in PostgreSQL, links and benchmarks under this related question on dba.SE.

How did I figure?
In a multi-column index, rows are ordered (and thereby clustered) by the first column of the index ("marketId"), and each cluster is in turn ordered by the second column of the index - so the first row matches the condition min("runnerId"). This makes the index scan extremely fast.

Concerning the paradox effect of LIMIT slowing down a query - the Postgres query planner has a weakness there. The common workaround is to use a CTE (not necessary in this case). Find more information under this recent, closely related question:
PostgreSQL query taking too long

نصائح أخرى

The min statement will be executed by PostgreSQL using a sequential scan of the entire table. You could optimize the query using the following approach: SELECT col FROM sometable ORDER BY col ASC LIMIT 1;

When you had an index on ("runnerId") (or at least with "runnerId" as the high order column) but did not have the index on ("marketId", "runnerId") it compared the cost of passing all rows with a matching "marketId" using the index on that column and picking out the minimum "runnerId" from that set to the cost of scanning using the index on "runnerId" and stopping when it found the first row with a matching "marketId". Based on available statistics and the assumption that "marketId" values would be randomly distributed within the index entries for the index on "runnerId" it estimated a lower cost for the latter approach.

It also estimated the cost of scanning the whole table and picking the minimum from matching rows as well as probably a number of other alternatives. It does not always use a certain type of plan, but compares costs of all the alternatives.

The problem is that the assumption that values will be randomly distributed in the range is not necessarily true (as in this example), leading to a scan of a high percentage of the range to find the rows lurking at the end. For some values of "marketId", where the chosen value is available near the beginning of the "runnerId" index, this plan should be very fast.

There has been discussion in the PostgreSQL developer community of how we might bias against plans which are "risky" in terms of running long if the data distribution is not what was assumed, and there has been work on tracking multi-column statistics so that correlated values don't run into such problems. Expect improvements in this area in the next few releases. Until then, Erwin's suggestions are on target for how to work around the issue.

Basically it comes down to making a more attractive plan available or introducing an optimization barrier. In this case you can provide a more attractive option by adding the index on ("marketId", "runnerId") -- which allows a very direct way to get straight to the answer. The planner assigns a very low cost to that alternative, causing it to be chosen. If you preferred not to add the index, you could force an optimization barrier by doing something like this:

SELECT min("runnerId")
  FROM (SELECT "runnerId" FROM betlog
          WHERE "marketId" = '107416794'
          OFFSET 0) x;

When there is an OFFSET clause (even for an offset of zero) it forces the subquery to be planned separately and its results fed to the outer query. I would expect this to run in 80 ms rather than the 1600 ms you get without the optimization barrier. Of course, if you can add the index, the speed of the query when data is cached should be less than 1 ms.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top