Optimizing SQL “Where” clause for queries with subqueries

https://stackoverflow.com/questions/7924093

15-02-2021
|

Question

Let's say I have the following hypothetical data structure:

create table "country"
(
  country_id integer,  
  country_name varchar(50),
  continent varchar(50),
  constraint country_pkey primary key (country_id)
);

create table "person"
(
  person_id integer,
  person_name varchar(100),
  country_id integer,
  constraint person_pkey primary key (person_id)
);

create table "event"
(
  event_id integer,
  event_desc varchar(100),
  country_id integer,
  constraint event_pkey primary key (event_id)
);

I want to query the number of rows of people and events per country. I decided to use a subquery.

select c.country_name, sum(sub1.person_count) as person_count, sum(sub2.event_count) as event_count
from
  "country" c
  left join (select country_id, count(*) as person_count from "person" group by country_id) sub1
    on (c.country_id=sub1.country_id)
  left join (select country_id, count(*) as event_count from "event" group by country_id) sub2
    on (c.country_id=sub2.country_id)
group by c.country_name

I know you can do this by using select statements in the fields list, but the advantage of using subqueries is that I am more flexible in changing the SQL to make it summarized and use another field. Let's say if I change the query to display it by continent, it will be as simple as replacing the field "c.country_name" into "c.continent".

My problem is regarding filtering. If we add a where clause like so:

select c.country_name, 
  sum(sub1.person_count) as person_count, 
  sum(sub2.event_count) as event_count
from
  "country" c
  left join (select country_id, count(*) as person_count from "person" group by country_id) sub1
    on (c.country_id=sub1.country_id)
  left join (select country_id, count(*) as event_count from "event" group by country_id) sub2
    on (c.country_id=sub2.country_id)
where c.country_name='UNITED STATES'
group by c.country_name

The subqueries seem to still execute the counting for all countries. Assume that the person and event tables are huge and I already have indexes on country_id on all tables. It's really slow. Shouldn't the database only execute the subqueries for the country that was filtered? Do i have to re-create the country filter to each subquery (this is very tedious and code is not easily modifiable)? I am using both PostgreSQL 8.3 and 9.0 by the way but I guess the same happens in other databases.

Solution

Shouldn't the database only execute the subqueries for the country that was filtered?

No. The first step in a query like yours is to appear to build a working table from all of the table constructors in the FROM clause. The WHERE clause is evaluated after that.

Imagine how you'd do this if sub1 and sub2 were both base tables instead of subselects. They'd both have two columns, and they'd both have one row for each country_id. And if you wanted to JOIN all the rows, you'd write it like this.

from
  "country" c
  left join sub1 on (c.country_id=sub1.country_id)
  left join sub2 on (c.country_id=sub2.country_id)

But if you wanted to JOIN on a single row, you'd write something equivalent to this.

from
  "country" c
  left join (select * from sub1 where country_id = ?)
    on (c.country_id=sub1.country_id)
  left join (select * from sub2 where country_id = ?)
    on (c.country_id=sub2.country_id)

Joe Celko, who helped develop early SQL standards, has often written about how SQL's order of evaluation appears on Usenet.

OTHER TIPS

Can you filtering/grouping rows using country_id not country_name? I suppose you do not have index on name.
Subqueries do not use any index on it is ok because you scan all table. If you want to reduce scans you should filter data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow