How do I eliminate a second seq scan over a table when deriving a new table?
-
06-10-2020 - |
سؤال
Let's say I have some sample data, 100 million rows.
CREATE TEMP TABLE foo
AS
SELECT id, md5(id::text), trunc(random()*1e6)
FROM generate_series(1,1e6) AS t(id);
This will generate a table like this..
id | md5 | trunc
----+----------------------------------+--------
1 | c4ca4238a0b923820dcc509a6f75849b | 159632
2 | c81e728d9d4c2f636f067f89cc14862c | 182952
3 | eccbc87e4b5ce2fe28308fd9f2a7baf3 | 438287
4 | a87ff679a2f3e71d9181a67b7542122c | 78240
5 | e4da3b7fbbce2345d7772b0674a318d5 | 20293
6 | 1679091c5a880faf6fb5e6087eb1b2dc | 909742
7 | 8f14e45fceea167a5a36dedd4bea2543 | 926496
8 | c9f0f895fb98ab9159f51fd0297e236d | 463718
9 | 45c48cce2e2d7fbdea1afc51c7c6ad26 | 65842
10 | d3d9446802a44259755d38e6d163e820 | 81791
How can I then generate a table with one scan that resembles this..
SELECT id, md5::text AS x
FROM foo
UNION ALL
SELECT id, trunc::text
FROM foo;
id | x
----+----------------------------------
1 | c4ca4238a0b923820dcc509a6f75849b
1 | 961453
2 | c81e728d9d4c2f636f067f89cc14862c
2 | 842364
3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
3 | 784693
4 | a87ff679a2f3e71d9181a67b7542122c
4 | 602039
5 | e4da3b7fbbce2345d7772b0674a318d5
5 | 176938
...
But that generates a query plan like this,
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..33832.52 rows=1514052 width=64) (actual time=0.025..1034.740 rows=2000000 loops=1)
-> Seq Scan on foo (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.025..173.272 rows=1000000 loops=1)
-> Seq Scan on foo foo_1 (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.016..715.279 rows=1000000 loops=1)
Planning time: 0.128 ms
Execution time: 1103.499 ms
(5 rows)
What would it look like to have one sec scan, and would it be faster if the table was only read once?
المحلول
A few methods, all test with PostgreSQL 9.5.
CROSS JOIN LATERAL ... VALUES
This is actually slower, but it seems good for a first attempt..
SELECT id, x
FROM foo
CROSS JOIN LATERAL (VALUES (md5),(trunc::text))
AS t(x);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..50982.43 rows=1514052 width=64) (actual time=0.035..1934.061 rows=2000000 loops=1)
-> Seq Scan on foo (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.027..114.655 rows=1000000 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) (actual time=0.000..0.001 rows=2 loops=1000000)
Planning time: 0.115 ms
Execution time: 2027.840 ms
(5 rows)
CROSS JOIN ... VALUES CASE
SELECT id, CASE WHEN x THEN md5 ELSE trunc::text END AS x
FROM foo
CROSS JOIN (VALUES (true),(false))
AS t(x);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..43412.20 rows=1514052 width=73) (actual time=0.036..1318.494 rows=2000000 loops=1)
-> Seq Scan on foo (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.026..108.375 rows=1000000 loops=1)
-> Materialize (cost=0.00..0.04 rows=2 width=1) (actual time=0.000..0.000 rows=2 loops=1000000)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=1) (actual time=0.002..0.003 rows=2 loops=1)
Planning time: 0.104 ms
Execution time: 1381.685 ms
(6 rows)
Row duplication with ARRAY/unnest
SELECT id, x
FROM foo
CROSS JOIN LATERAL unnest(ARRAY[md5,trunc::text])
AS t(x);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.01..1530968.27 rows=75702600 width=64) (actual time=0.036..3329.324 rows=2000000 loops=1)
-> Seq Scan on foo (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.015..156.087 rows=1000000 loops=1)
-> Function Scan on unnest t (cost=0.01..1.01 rows=100 width=32) (actual time=0.002..0.003 rows=2 loops=1000000)
Planning time: 0.054 ms
Execution time: 3439.064 ms
(5 rows)
tldr;
Neither of these methods are faster. They're slower, and more complex. Stick with the UNION ALL
.
لا تنتمي إلى dba.stackexchange