PostgreSQL: How to figure out missing numbers in a column using generate_series()?

https://stackoverflow.com/questions/12444142

02-07-2021
|

Question

SELECT commandid 
FROM results 
WHERE NOT EXISTS (
    SELECT * 
    FROM generate_series(0,119999) 
    WHERE generate_series = results.commandid 
    );

I have a column in results of type int but various tests failed and were not added to the table. I would like to create a query that returns a list of commandid that are not found in results. I thought the above query would do what I wanted. However, it does not even work if I use a range that is outside the expected possible range of commandid (like negative numbers).

Solution

Given sample data:

create table results ( commandid integer primary key);
insert into results (commandid) select * from generate_series(1,1000);
delete from results where random() < 0.20;

This works:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE NOT EXISTS (SELECT 1 FROM results WHERE commandid = s.i);

as does this alternative formulation:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
LEFT OUTER JOIN results ON (results.commandid = s.i) 
WHERE results.commandid IS NULL;

Both of the above appear to result in identical query plans in my tests, but you should compare with your data on your database using EXPLAIN ANALYZE to see which is best.

Explanation

Note that instead of NOT IN I've used NOT EXISTS with a subquery in one formulation, and an ordinary OUTER JOIN in the other. It's much easier for the DB server to optimise these and it avoids the confusing issues that can arise with NULLs in NOT IN.

I initially favoured the OUTER JOIN formulation, but at least in 9.1 with my test data the NOT EXISTS form optimizes to the same plan.

Both will perform better than the NOT IN formulation below when the series is large, as in your case. NOT IN used to require Pg to do a linear search of the IN list for every tuple being tested, but examination of the query plan suggests Pg may be smart enough to hash it now. The NOT EXISTS (transformed into a JOIN by the query planner) and the JOIN work better.

The NOT IN formulation is both confusing in the presence of NULL commandids and can be inefficient:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE s.i NOT IN (SELECT commandid FROM results);

so I'd avoid it. With 1,000,000 rows the other two completed in 1.2 seconds and the NOT IN formulation ran CPU-bound until I got bored and cancelled it.

OTHER TIPS

As I mentioned in the comment, you need to do the reverse of the above query.

SELECT
    generate_series
FROM
    generate_series(0, 119999)
WHERE
    NOT generate_series IN (SELECT commandid FROM results);

At that point, you should find values that do not exist within the commandid column within the selected range.

I am not so experienced SQL guru, but I like other ways to solve problem. Just today I had similar problem - to find unused numbers in one character column. I have solved my problem by using pl/pgsql and was very interested in what will be speed of my procedure. I used @Craig Ringer's way to generate table with serial column, add one million records, and then delete every 99th record. This procedure work about 3 sec in searching for missing numbers:

-- creating table
create table results (commandid character(7) primary key);
-- populating table with serial numbers formatted as characters
insert into results (commandid) select cast(num_id as character(7)) from generate_series(1,1000000) as num_id;
-- delete some records
delete from results where cast(commandid as integer) % 99 = 0;

create or replace function unused_numbers()
  returns setof integer as
$body$
declare
   i integer;
   r record;
begin
   -- looping trough table with sychronized counter:
   i := 1;
   for r in
      (select distinct cast(commandid as integer) as num_value
      from results
      order by num_value asc)
   loop
      if not (i = r.num_value) then
            while true loop
               return next i;

               i = i + 1;
               if (i = r.num_value) then
                     i = i + 1;
                     exit;
                  else
                     continue;
               end if;
            end loop;
         else
            i := i + 1;
      end if;
   end loop;

   return;
end;
$body$
  language plpgsql volatile
  cost 100
  rows 1000;

select * from unused_numbers();

Maybe it will be usable for someone.

If you're on AWS redshift, you might end up needing to defy the question, since it doesn't support generate_series. You'll end up with something like this:

select 
    startpoints.id    gapstart, 
    min(endpoints.id) resume 
from (
     select id+1 id 
     from   yourtable outer_series 
     where not exists 
         (select null 
          from   yourtable inner_series 
          where  inner_series.id = outer_series.id + 1
         )
     order by id
     ) startpoints,   

     yourtable endpoints 
where 
    endpoints.id > startpoints.id 
group by 
    startpoints.id;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow