PostgreSQL internals composite column cardinality

Question 1

The key differences between a composite index (an index on (a, b)) and an index on a hash function are:

with a composite index PostgreSQL can make decisions based on the statistics it keeps for each individual column; and
in a composite index you can query the index efficiently for just a. You can not query it for just b, though.

On the other hand, with an index on a::bigint << 32 + b, ie a 64-bit single-element index that combines the values of a and b, you can only use it when you have both a and b. The same is true of an index on some_hash_function(a,b).

There can be a big advantage to an index on the hash of the value, in that it makes the index a lot smaller at the cost of reduced selectivity and the need to re-check the condition with something like:

WHERE some_hash_function(a,b) = some_hash_function(42,3) AND (a = 42 AND b = 3)

There is a significant possibility you neglected to consider, though: two separate indexes on a and b. PostgreSQL can combine these in a bitmap index scan or use them separately, whatever's more appropriate for the query. This is usually the best option for two loosely related and mostly uncorrelated values.

Given the example:

CREATE TABLE demoab(a integer, b integer);

INSERT INTO demoab(a, b)
SELECT a, b from generate_series(1,1000) a 
CROSS JOIN generate_series(1,1000) b;

CREATE INDEX demoab_a ON demoab(a);
CREATE INDEX demoab_b ON demoab(b);
CREATE INDEX demoab_ab ON demoab(a,b);
CREATE INDEX demoab_ab_shifted ON demoab ((a::bigint << 32 + b));
ANALYZE demoab;

CREATE TABLE demob AS SELECT DISTINCT b FROM demoab ;
CREATE TABLE demoa AS SELECT DISTINCT a FROM demoab ;
ALTER TABLE demoa ADD PRIMARY KEY (a);
ALTER TABLE demob ADD PRIMARY KEY (b);

Different query approaches:

regress=> explain analyze SELECT * FROM demoab WHERE a = 42 AND b = 3;
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Index Scan using demoab_ab on demoab  (cost=0.00..8.38 rows=1 width=8) (actual time=0.034..0.036 rows=1 loops=1)
   Index Cond: ((a = 42) AND (b = 3))
 Total runtime: 0.088 ms
(3 rows)


regress=> explain analyze SELECT * FROM demoab WHERE b = 3;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on demoab  (cost=19.85..2358.66 rows=967 width=8) (actual time=1.089..4.636 rows=1000 loops=1)
   Recheck Cond: (b = 3)
   ->  Bitmap Index Scan on demoab_b  (cost=0.00..19.61 rows=967 width=0) (actual time=0.661..0.661 rows=1000 loops=1)
         Index Cond: (b = 3)
 Total runtime: 4.820 ms
(5 rows)

regress=> explain analyze SELECT * FROM demoab WHERE a = 42;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using demoab_a on demoab  (cost=0.00..37.19 rows=962 width=8) (actual time=0.155..0.751 rows=1000 loops=1)
   Index Cond: (a = 42)
 Total runtime: 0.929 ms
(3 rows)

regress=> explain analyze SELECT * FROM demoab WHERE (a::bigint << 32 + b) = (42::bigint << 32 + 3);
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on demoab  (cost=4.69..157.67 rows=41 width=8) (actual time=0.260..0.495 rows=94 loops=1)
   Recheck Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
   ->  Bitmap Index Scan on demoab_ab_shifted  (cost=0.00..4.67 rows=41 width=0) (actual time=0.232..0.232 rows=94 loops=1)
         Index Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
 Total runtime: 0.584 ms
(5 rows)

Here, the composite index on (a,b) would be a clear win because it's able to use an index only scan to get the tuple directly, but it wouldn't be likely to in reality where you'll be fetching values from non-indexed columns. Because of that I ran SET enable_indexscan = off for these tests.

Rather surprisingly, the index sizes are the same:

regress=> SELECT 
pg_relation_size('demoab_ab') AS shifted, 
pg_relation_size('demoab_ab') AS ab, 
pg_relation_size('demoab_a') AS a, 
pg_relation_size('demoab_b') AS b;
 shifted  |    ab    |    a     |    b     
----------+----------+----------+----------
 22487040 | 22487040 | 22487040 | 22487040
(1 row)

I expected the single-value indexes to need considerably less space. Alignment requirements explain some of it, but it's still rather an unexpected result to me.

In the case of the join @wildplasser asked about:

regress=> EXPLAIN ANALYZE 
SELECT demoa.a, demob.b 
FROM demoab 
INNER JOIN demoa ON (demoa.a = demoab.a) 
INNER JOIN demob ON (demob.b = demoab.b) 
WHERE demoa.a = 100 AND demob.b = 500;

                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..24.94 rows=1 width=8) (actual time=0.121..0.126 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..16.66 rows=1 width=8) (actual time=0.089..0.092 rows=1 loops=1)
         ->  Index Scan using demoab_ab on demoab  (cost=0.00..8.38 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=1)
               Index Cond: ((a = 100) AND (b = 500))
         ->  Index Scan using demoa_pkey on demoa  (cost=0.00..8.27 rows=1 width=4) (actual time=0.062..0.062 rows=1 loops=1)
               Index Cond: (a = 100)
   ->  Index Scan using demob_pkey on demob  (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.031 rows=1 loops=1)
         Index Cond: (b = 500)
 Total runtime: 0.203 ms
(9 rows)

shows that in this case PostgreSQL is preferring the composite index on (a, b). This will not be the case if you're only joining on b though:

regress=> EXPLAIN ANALYZE 
SELECT demoab.a, demoab.b 
FROM demoab 
INNER JOIN demob ON (demob.b = demoab.b) 
WHERE demob.b = 500;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=19.85..2376.59 rows=967 width=8) (actual time=0.935..3.653 rows=1000 loops=1)
   ->  Index Scan using demob_pkey on demob  (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.032 rows=1 loops=1)
         Index Cond: (b = 500)
   ->  Bitmap Heap Scan on demoab  (cost=19.85..2358.66 rows=967 width=8) (actual time=0.897..3.123 rows=1000 loops=1)
         Recheck Cond: (b = 500)
         ->  Bitmap Index Scan on demoab_b  (cost=0.00..19.61 rows=967 width=0) (actual time=0.436..0.436 rows=1000 loops=1)
               Index Cond: (b = 500)
 Total runtime: 3.834 ms
(8 rows)

You'll note that any functional hash index won't be useful here. So for that reason, I recommend going for a composite index on (a,b) plus a secondary index on just (b) if you need it.

In terms of uniqueness, you'll find it informative to look at pg_catalog.pg_stat. There, you'll see that PostgreSQL doesn't maintain statistics on individual indexes, only on the heap columns that are indexed. In this case:

regress=> select tablename, attname, n_distinct, correlation 
from pg_stats where tablename like 'demo%';
     tablename     | attname | n_distinct | correlation 
-------------------+---------+------------+-------------
 demoab            | a       |       1000 |           1
 demoab            | b       |       1000 |   0.0105023
 demoab_ab_shifted | expr    |      21593 |   0.0175595
 demob             | b       |         -1 |    0.021045
 demoa             | a       |         -1 |    0.021045
(5 rows)

it doesn't look like Pg will see any significant difference between the hashed/combined approach and two discrete, independent values.

Question 2

If Col1 and Col2 are independent fields, why combining? It won't save any space. Just stick to the atomicity principle of database.