Better way to select from nested tables in PostgreSQL

https://dba.stackexchange.com/questions/94773

16-12-2020
|

Question

I have the following schema with millions of rows in the match table and each match have two sides (away and home). I want to create a view which shows the most significant data regarding the matches replacing the ids with names. Names can change, this is the reason of the separate tables for coaches and teams.

schema

I came up with two approaches and I am not satisfied with either of the two.

CREATE VIEW "fumbbl"."matches" AS
  SELECT "fmid", "time",
    (SELECT "name" FROM "fumbbl"."division"
      WHERE "fdid" = m."fdid") "division",
    (SELECT "coachbracket" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS FALSE) "hbracket",
    (SELECT "name" FROM "fumbbl"."coach"
      WHERE "fcid" = (SELECT "fcid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS FALSE)) "hcoach",
    (SELECT "name" FROM "fumbbl"."team"
      WHERE "ftid" = (SELECT "ftid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS FALSE)) "hteam",
    (SELECT "kilotv" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS FALSE) "hktv",
    (SELECT "race" FROM "fumbbl"."team"
      WHERE "ftid" = (SELECT "ftid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS FALSE)) "hrace",
    (SELECT "score" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS FALSE) "hscore",
    (SELECT "score" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS TRUE) "ascore",
    (SELECT "race" FROM "fumbbl"."team"
      WHERE "ftid" = (SELECT "ftid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS TRUE)) "arace",
    (SELECT "kilotv" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS TRUE) "aktv",
    (SELECT "name" FROM "fumbbl"."team"
      WHERE "ftid" = (SELECT "ftid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS TRUE)) "ateam",
    (SELECT "name" FROM "fumbbl"."coach"
      WHERE "fcid" = (SELECT "fcid" FROM "fumbbl"."side"
        WHERE "fmid" = m."fmid" AND "away" IS TRUE)) "acoach",
    (SELECT "coachbracket" FROM "fumbbl"."side"
      WHERE "fmid" = m."fmid" AND "away" IS TRUE) "abracket"
  FROM "fumbbl"."match" m ORDER BY "fmid" DESC;

The problem with this one is that I guess it is doing identical subqueries multiple times which makes it ineffective. However it performs better than the second one on limited number of rows. It is slower though when making the selection on all data or when doing an ascending select on fmid.

First row of explain:

"Index Scan using match_fmid_desc on match m  (cost=0.42..46098616.62 rows=289261 width=14)"

Here the index used is

CREATE INDEX "match_fmid_desc" ON "fumbbl"."match" ("fmid" DESC);

The second approach uses joins:

CREATE VIEW "fumbbl"."matches" AS
  SELECT
      m."fmid",
      m."time",
      (SELECT "name" FROM "fumbbl"."division"
         WHERE "fdid" = m."fdid") "division",
      "hbracket",
      (SELECT "name" FROM "fumbbl"."coach"
         WHERE "fcid" = hst."fcid") "hcoach",
      "hteam",
      "hktv",
      "hrace",
      "hscore",
      "ascore",
      "arace",
      "aktv",
      "ateam",
      (SELECT "name" FROM "fumbbl"."coach"
         WHERE "fcid" = ast."fcid") "acoach",
      "abracket"
    FROM (
      SELECT *
      FROM "fumbbl"."match"
      ) m
    JOIN (
      (SELECT
        "fmid", "away", "fcid", "ftid",
        "coachbracket" AS "hbracket",
        "kilotv" AS "hktv",
        "score" AS "hscore"
      FROM "fumbbl"."side") hside
      JOIN (
        SELECT
          "ftid",
          "name" AS "hteam",
          "race" AS "hrace"
        FROM "fumbbl"."team") ht
        ON (hside."ftid" = ht."ftid")
      ) hst
    ON (m."fmid" = hst."fmid" AND hst."away" IS FALSE)
    JOIN (
      (SELECT
        "fmid", "away", "fcid", "ftid",
        "coachbracket" AS "abracket",
        "kilotv" AS "aktv",
        "score" AS "ascore"
      FROM "fumbbl"."side") aside
      JOIN (
        SELECT
          "ftid",
          "name" AS "ateam",
          "race" AS "arace"
        FROM "fumbbl"."team") at
        ON (aside."ftid" = at."ftid")
      ) ast
    ON (m."fmid" = ast."fmid" AND ast."away" IS TRUE)
  ORDER BY "fmid" DESC;

First row of explain:

"Sort  (cost=7238489.68..7239207.00 rows=286927 width=84)"

My problem with this one is that it joins the whole tables before providing even a single row. Performs better on the whole data though.

I am quite new to PostgreSQL so maybe I missed a method to perform such kind of query in an effective way. I would be glad to have an iterative method which gets the data of the two sides only once by each iteration.

EDIT: After Erwin's answer, which performed nicely by explain value I still wanted to try another way. I made a function which does the most of the job by getting the side fields of a match. This function is used by the view.

CREATE OR REPLACE FUNCTION fumbbl.get_sides_data(
  in "fmid" integer,
  out "hcoach" character varying,
  out "hcoachbracket" "fumbbl"."bracket",
  out "hteam" character varying,
  out "hrace" character varying,
  out "hkilotv" smallint,
  out "hscore" smallint,
  out "acoach" character varying,
  out "acoachbracket" "fumbbl"."bracket",
  out "ateam" character varying,
  out "arace" character varying,
  out "akilotv" smallint,
  out "ascore" smallint
) AS
$$
DECLARE
  sfcid integer;
  sftid integer;
BEGIN
  SELECT s."fcid", s."coachbracket", s."kilotv", s."ftid", s."score"
    from fumbbl.side s
    WHERE (s."fmid" = $1 AND s."away" IS FALSE) INTO sfcid, $3, $6, sftid, $7;
  SELECT c."name" from fumbbl.coach c WHERE c."fcid" = sfcid INTO $2;
  SELECT t."name", t."race" from fumbbl.team t WHERE t."ftid" = sftid INTO $4, $5;

  SELECT s."fcid", s."coachbracket", s."kilotv", s."ftid", s."score"
    from fumbbl.side s
    WHERE (s."fmid" = $1 AND s."away" IS TRUE) INTO sfcid, $9, $12, sftid, $13;
  SELECT c."name" from fumbbl.coach c WHERE c."fcid" = sfcid INTO $8;
  SELECT t."name", t."race" from fumbbl.team t WHERE t."ftid" = sftid INTO $10, $11;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE VIEW fumbbl.matches4 AS
SELECT
m.fmid,
m.time,
(SELECT "name" FROM "fumbbl"."division" WHERE "fdid" = m."fdid") "division",
s."hcoach",
s."hcoachbracket",
s."hteam",
s."hrace",
s."hkilotv",
s."hscore",
s."acoach",
s."acoachbracket",
s."ateam",
s."arace",
s."akilotv",
s."ascore"
FROM fumbbl.match m, fumbbl.get_sides_data(m.fmid) s
ORDER  BY m.fmid DESC;

Conclusions:

I realized that explain can not measure the effectiveness of the view which calls the function. So I ended up comparing the options (including Erwin's) with a timer. Not surprisingly Option 1 (many selects) and Option 3 (function caller) performs better in selects with fever rows. Option 3 seemes to dominate on 1-10 rows in my database. However, for hundreds of thousands of rows, JOIN is the way. By using JOIN, you have to wait the whole operation before getting a single row. Surprisingly, Erwin's way seems to be just as fast as Option 2 (nested joins) which makes me feel that PG does some kind of optimization in the background. Still, Erwin's code is nice and more maintainable.

Solution

In Postgres 9.4 you can simplify with the aggregate FILTER clause:

CREATE VIEW fumbbl.matches AS
SELECT m.fmid, m.time, d.name AS division
     , min(s.coachbracket) FILTER (WHERE NOT s.away) AS hbracket
     , min(t.name)         FILTER (WHERE NOT s.away) AS hteam
     , min(c.name)         FILTER (WHERE NOT s.away) AS hcoach
     , min(s.score)        FILTER (WHERE NOT s.away) AS hscore
     , min(t.race)         FILTER (WHERE NOT s.away) AS hrace
     , min(s.kilotv)       FILTER (WHERE NOT s.away) AS hktv

     , min(s.coachbracket) FILTER (WHERE s.away) AS abracket
     , min(t.name)         FILTER (WHERE s.away) AS ateam
     , min(c.name)         FILTER (WHERE s.away) AS acoach
     , min(s.score)        FILTER (WHERE s.away) AS ascore
     , min(t.race)         FILTER (WHERE s.away) AS arace
     , min(s.kilotv)       FILTER (WHERE s.away) AS aktv
FROM   fumbbl.match    m
JOIN   fumbbl.division d USING (fdid)
JOIN   fumbbl.side     s USING (fmid)
JOIN   fumbbl.team     t USING (ftid)
JOIN   fumbbl.coach    c USING (fcid)
GROUP  BY m.fmid, d.fdid  -- PK columns
ORDER  BY m.fmid DESC;

Joins to each table only once. But it needs an aggregate step to fold home and away side into one row.

Details for the new FILTER clause:

Return counts for multiple ranges in a single SELECT statement

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange