How to (optimally) get random sample of (a.id, b.id) pairs from two tables (a, b)?

https://dba.stackexchange.com/questions/159213

05-10-2020
|

Question

Let's assume I have to very simple tables

CREATE TABLE a(id integer PRIMARY KEY, 
       t timestamp default now(), 
       sensor_readings real[]);
CREATE TABLE b(id integer PRIMARY KEY, 
       t timestamp default now(), 
       sensor_readings real[]);

with some data on them

INSERT INTO a(id) SELECT generate_series(    1,   100);
INSERT INTO b(id) SELECT generate_series(10001, 10100);

In reality, table a might have about 100_000 rows, and table b about 50_000. In practive, also, the id sequence might have gaps (in the order of a few %). Thus, the cartesian product a x b has a cardinality of billions.

I want to take a random sample of 1_000 sorted pairs (a.id, b.id). I can use something like the following query:

SELECT  
    *
FROM
(
    SELECT
        *
    FROM
        (
        SELECT 
            a.id AS a_id, b.id AS b_id
        FROM
            a CROSS JOIN b
        ORDER BY
            random()
        ) AS s0
    LIMIT
        1000 
) AS s1
ORDER BY
    a_id, b_id ;

... but it would become extremely inefficient as soon as the number of rows on a or b grows (because of the growth of the CROSS JOIN).

Is there any way of doing something equivalent to this in an optimal manner? That is, is there a practical way of getting a random sample of rows from the a x b relation without actually having to instantiate it.

NOTE: There's no limitation on the fact that a.id or b.id can be repeated. Although the pair (a.id, b.id) cannot.

If I were trying to program this in an imperative language, I'd most probably would use a loop and be doing something like the following pseudo-code (and then, have it checked by a statistitian, to make sure I really take a sample where all the pairs have the same probability of being chosen):

start with a result set equal to {} (empty set)
while size of result set < 1000
    Pick the id value from a random row from table a -> rand_id_a
    Pick the id value from a random row from table b -> rand_id_b
    If (rand_id_a, rand_id_b) not in result set
        append (rand_id_a, rand_id_b) to result set
    end if
end while
sort result set and return it

Is there a way to achieve an equivalent result without resorting to loops? If not, is there an efficient way to do it using plpgSQL? (or any other language)

Solution

The best solution depends on the exact definition of your setup. For the example setup it's trivial:

Serial integer columns without gaps.

SELECT DISTINCT
           1 + trunc(random() * 100)::int AS a_id
     , 10001 + trunc(random() * 100)::int AS b_id
FROM   generate_series(1, 1100) g  -- enough excess to make up for possible dupes
LIMIT  1000;  -- only take 1000

The only interesting question: how to fold dupes efficiently. The solution: let Postgres decide. Simply use DISTINCT.
We don't even need to involve the tables at all. Super fast.

Note that random() generates (per documentation):

random value in the range 0.0 <= x < 1.0

Hence 1 + trunc(random() * 100)::int to cover ID numbers between 1 and 100 exactly.

Actual setup?

You need be be more specific about your actual setup. Let's assume there is at least a payload column in each of your tables, not just ID columns.

CREATE TABLE a(a_id integer PRIMARY KEY, a text);
CREATE TABLE b(b_id integer PRIMARY KEY, b text);

INSERT INTO a(a_id, a)
SELECT g, 't' || g FROM generate_series(    1,   100) g;
INSERT INTO b(b_id, b)
SELECT g, 't' || g FROM generate_series(10001, 10100) g;

Query:

SELECT a.a_id, a.a, b.b_id, b.b
FROM  (
    SELECT DISTINCT
               1 + trunc(random() * 100)::int AS a_id  -- cover *whole* key space
         , 10001 + trunc(random() * 100)::int AS b_id  -- maybe add reserve for new rows
    FROM   generate_series(1, 1100) g
    LIMIT  1000
    ) ra
JOIN   a USING (a_id)
JOIN   b USING (b_id);

Truly random, very fast and almost independent of the actual table size.

All you need is indexes on a(a_id) and b(b_id). Or possibly multicolumn indexes to allow index-only scans.

The solution also works for a few gaps from skipped nextval() calls, as long as there are not much more gaps than islands, it's still very cheap to generate enough combinations to cover losses by gaps. (Much cheaper than working with a Cartesian product of big tables or sorting whole big tables with ORDER BY random() anyway.) Just be sure to generate enough combinations.

SELECT a.a_id, a.a, b.b_id, b.b
FROM  (
    SELECT DISTINCT
               1 + trunc(random() * 100)::int AS a_id
         , 10001 + trunc(random() * 100)::int AS b_id
    FROM   generate_series(1, 1100) g  -- enough to cover dupes *and* gaps
    ) ra
JOIN   a USING (a_id)
JOIN   b USING (b_id)
LIMIT  1000;  -- LIMIT moves to outer query to cover gaps

With more than a few gaps, start with a number of combinations that's enough 95 % of the time and add a recursive step to add more rows if you should fall short. There is a recipe for this solution (for a single table) in the related answer. Also more explanation and variations:

Best way to select random rows PostgreSQL

OTHER TIPS

I want to take a random sample of 1000 sorted pairs (a.id, b.id).

It always depends on what random means, but if you're defining the amount of rows you want then you likely want the extension tsm_system_rows

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.

This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

First install the extension

CREATE EXTENSION tsm_system_rows;

Then your query,

SELECT *
FROM a
CROSS JOIN b
TABLESAMPLE SYSTEM_ROWS(1000);

The important thing here is that it always provides 1000 ROWS, which is more than we can say for random() <= 0.10, or for TABLESAMPLE BERNOULLI.

If that's not good nuff'

If you truly need random and can't accept the clustering drawback, then I would use

ORDER BY random()
LIMIT x;

If you need to trim out duplicates

The only sane way to trim out duplicates (if a.id, and b.id are not UNIQUE) and keep the result set random, is to do it beforehand. That can get nasty because TABLESAMPLE doesn't yet work in virtual tables, so you'll have to create a temporary table (which may still persist in memory). Shy of that, you can use the other method which is also slow and ugly, but at least it doesn't have to write the

SELECT *
FROM (
  SELECT DISTINCT ON(a.id, b.id) a.id, b.id
  FROM a
  CROSS JOIN b
) AS t
ORDER BY random()
FETCH FIRST 1000 ROWS ONLY;

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange