How do I eliminate a second seq scan over a table when deriving a new table?

https://dba.stackexchange.com/questions/164605

06-10-2020
|

سؤال

Let's say I have some sample data, 100 million rows.

CREATE TEMP TABLE foo
AS
  SELECT id, md5(id::text), trunc(random()*1e6)
  FROM generate_series(1,1e6) AS t(id);

This will generate a table like this..

 id |               md5                | trunc  
----+----------------------------------+--------
  1 | c4ca4238a0b923820dcc509a6f75849b | 159632
  2 | c81e728d9d4c2f636f067f89cc14862c | 182952
  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3 | 438287
  4 | a87ff679a2f3e71d9181a67b7542122c |  78240
  5 | e4da3b7fbbce2345d7772b0674a318d5 |  20293
  6 | 1679091c5a880faf6fb5e6087eb1b2dc | 909742
  7 | 8f14e45fceea167a5a36dedd4bea2543 | 926496
  8 | c9f0f895fb98ab9159f51fd0297e236d | 463718
  9 | 45c48cce2e2d7fbdea1afc51c7c6ad26 |  65842
 10 | d3d9446802a44259755d38e6d163e820 |  81791

How can I then generate a table with one scan that resembles this..

SELECT id, md5::text AS x
FROM foo
UNION ALL
  SELECT id, trunc::text
  FROM foo;

 id |                x                 
----+----------------------------------
  1 | c4ca4238a0b923820dcc509a6f75849b
  1 | 961453
  2 | c81e728d9d4c2f636f067f89cc14862c
  2 | 842364
  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
  3 | 784693
  4 | a87ff679a2f3e71d9181a67b7542122c
  4 | 602039
  5 | e4da3b7fbbce2345d7772b0674a318d5
  5 | 176938
...

But that generates a query plan like this,

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..33832.52 rows=1514052 width=64) (actual time=0.025..1034.740 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.025..173.272 rows=1000000 loops=1)
   ->  Seq Scan on foo foo_1  (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.016..715.279 rows=1000000 loops=1)
 Planning time: 0.128 ms
 Execution time: 1103.499 ms
(5 rows)

What would it look like to have one sec scan, and would it be faster if the table was only read once?

Question inspired from this conversation

المحلول

Answer inspired from this

A few methods, all test with PostgreSQL 9.5.

`CROSS JOIN LATERAL ... VALUES`

This is actually slower, but it seems good for a first attempt..

SELECT id, x
FROM foo
CROSS JOIN LATERAL (VALUES (md5),(trunc::text))
  AS t(x);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..50982.43 rows=1514052 width=64) (actual time=0.035..1934.061 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.027..114.655 rows=1000000 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=32) (actual time=0.000..0.001 rows=2 loops=1000000)
 Planning time: 0.115 ms
 Execution time: 2027.840 ms
(5 rows)

`CROSS JOIN ... VALUES CASE`

SELECT id, CASE WHEN x THEN md5 ELSE trunc::text END AS x 
FROM foo
CROSS JOIN (VALUES (true),(false))
  AS t(x);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..43412.20 rows=1514052 width=73) (actual time=0.036..1318.494 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.026..108.375 rows=1000000 loops=1)
   ->  Materialize  (cost=0.00..0.04 rows=2 width=1) (actual time=0.000..0.000 rows=2 loops=1000000)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=1) (actual time=0.002..0.003 rows=2 loops=1)
 Planning time: 0.104 ms
 Execution time: 1381.685 ms
(6 rows)

Row duplication with `ARRAY/unnest`

SELECT id, x
FROM foo
CROSS JOIN LATERAL unnest(ARRAY[md5,trunc::text])
  AS t(x);

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.01..1530968.27 rows=75702600 width=64) (actual time=0.036..3329.324 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.015..156.087 rows=1000000 loops=1)
   ->  Function Scan on unnest t  (cost=0.01..1.01 rows=100 width=32) (actual time=0.002..0.003 rows=2 loops=1000000)
 Planning time: 0.054 ms
 Execution time: 3439.064 ms
(5 rows)

tldr;

Neither of these methods are faster. They're slower, and more complex. Stick with the UNION ALL.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange

How do I eliminate a second seq scan over a table when deriving a new table?

CROSS JOIN LATERAL ... VALUES

CROSS JOIN ... VALUES CASE

Row duplication with ARRAY/unnest

tldr;

`CROSS JOIN LATERAL ... VALUES`

`CROSS JOIN ... VALUES CASE`

Row duplication with `ARRAY/unnest`