Searching INET[] for CIDR match, efficiently using indexes (without UNNEST)

https://dba.stackexchange.com/questions/278309

09-03-2021
|

Domanda

This question stems from having recently discovered that the following are very, very different in terms of performance

DROP TABLE IF EXISTS dns_lookup;
CREATE TABLE IF NOT EXISTS dns_lookup (
        fqdn TEXT,
        ip_address INET[],
        description TEXT,
        PRIMARY KEY (ip_address));
DROP INDEX IF EXISTS dns_lookup_ip_address_gin_idx;
CREATE INDEX dns_lookup_ip_address_gin_idx ON dns_lookup USING gin (ip_address);
INSERT INTO dns_lookup(fqdn, ip_address, description) VALUES('test1.com', '{1.2.3.4, 1.2.3.5, 6.7.8.9}', 'Some Public Networks');
INSERT INTO dns_lookup(fqdn, ip_address, description) VALUES('test2.com', '{192.168.1.1, 192.168.1.2, 192.168.1.3}', 'Some Private Networks');

In this example, fqdn TEXT is the DNS name of a host and the ip_address INET[] is all of the DNS A records for that fqdn

Assume of course that the real table has many millions of rows and becomes expensive to search

The first query using ANY() will not make use of any indexes, so while the syntax is very straightforward it's also very, very slow. Using ANY():

psql> EXPLAIN ANALYZE SELECT * FROM dns_lookup WHERE '1.2.3.4' = ANY(ip_address);
QUERY PLAN
Seq Scan on dns_lookup  (cost=0.00..24.62 rows=3 width=96) (actual time=0.007..0.008 rows=1 loops=1)
  Filter: ('1.2.3.4'::inet = ANY (ip_address))
  Rows Removed by Filter: 1
Planning time: 0.134 ms
Execution time: 0.025 ms
(5 rows)
Time: 89.155 ms

The better way to do this is by casting the parameter to an INET[] so you can use the @> operator for the search. This is significantly faster since it uses the GIN index:

psql> EXPLAIN ANALYZE SELECT * FROM dns_lookup WHERE ARRAY['1.2.3.4']::INET[] @> ip_address;
QUERY PLAN
Bitmap Heap Scan on dns_lookup  (cost=8.03..12.42 rows=3 width=96) (actual time=0.010..0.010 rows=0 loops=1)
  Recheck Cond: ('{1.2.3.4}'::inet[] @> ip_address)
  Rows Removed by Index Recheck: 1
  Heap Blocks: exact=1
  ->  Bitmap Index Scan on dns_lookup_ip_address_gin_idx  (cost=0.00..8.02 rows=3 width=0) (actual time=0.006..0.006 rows=1 loops=1)
        Index Cond: ('{1.2.3.4}'::inet[] @> ip_address)
Planning time: 0.039 ms
Execution time: 0.025 ms
(8 rows)
Time: 89.696 ms

My question now is, is there something similar I can do to improve a query that looks like this:

SELECT * FROM test_host WHERE '192.168.0.0/16' >> ANY(ip_address);

This is more complex than a simple contains operation as it's a network contains IP address on each element on the array. Rather than "show me all FQDNs that resolved to '192.168.0.1'" it's "show me all FQDNs that resolved to an IP in the '192.168.0.0/16' network". Very different, and is using the inet_ops operators in PostgreSQL. The inet_ops operators can use indexes, but because ip_address is an array, it does not apply

Right now I'm thinking to get better performance it will be necessary to create an index that is effectively on the UNNEST() of the ip_address column, and perform an UNNEST() in the query. This is an acceptable solution but not as simple as I would like. If my logic is flawed and that is not a good solution, please let me know why and/or what is a better approach. I have very limited experience with SQL and am very happy to be corrected.

In summary, how can I do this more efficiently while avoiding having to do something like create a separate table with the ip_address unnested? Or is that really my best option?

SELECT * FROM test_host WHERE '192.168.0.0/16' >> ANY(ip_address);

I am aware of the ip4r extension though have not yet worked with it much. It's not immediately obvious to me how this may help

EDIT: For the benefit of anyone who may misunderstand- @a_horse_with_no_name, take a look and you'll see what I'm talking about. Here is how the planner works in a real table, not one contrived with just two rows. That was just an example to show you the structure of the table:

[local:/tmp]:5432 dns-dba@dnsdb-dev=# EXPLAIN ANALYZE SELECT * FROM dns_lookup WHERE ARRAY['1.2.3.4']::INET[] @> ip_address;
QUERY PLAN
Bitmap Heap Scan on dns_lookup  (cost=10.17..53.70 rows=22 width=224) (actual time=0.072..0.109 rows=12 loops=1)
  Recheck Cond: ('{1.2.3.4}'::inet[] @> ip_address)
  Heap Blocks: exact=12
  ->  Bitmap Index Scan on dns_lookup_ip_address_gin_inetops_idx  (cost=0.00..10.16 rows=22 width=0) (actual time=0.056..0.056 rows=12 loops=1)
        Index Cond: ('{1.2.3.4}'::inet[] @> ip_address)
Planning time: 0.914 ms
Execution time: 0.139 ms
(7 rows)
Time: 219.104 ms
[local:/tmp]:5432 dns-dba@dnsdb-dev=# EXPLAIN ANALYZE SELECT * FROM dns_lookup WHERE '1.2.3.4' = ANY(ip_address);
QUERY PLAN
Gather  (cost=1000.00..24477.96 rows=39 width=224) (actual time=30.506..69.379 rows=12 loops=1)
  Workers Planned: 3
  Workers Launched: 3
  ->  Parallel Seq Scan on dns_lookup  (cost=0.00..23474.06 rows=13 width=224) (actual time=30.685..65.488 rows=3 loops=4)
        Filter: ('1.2.3.4'::inet = ANY (ip_address))
        Rows Removed by Filter: 146905
Planning time: 0.171 ms
Execution time: 69.411 ms
(8 rows)
Time: 289.164 ms
[local:/tmp]:5432 dns-dba@dnsdb-dev=#

Using ANY and UNNEST obviously does not use any indexes as I said; And the performance is significantly faster when using the operator, as would be expected when being able to use an index on a large table...

EDIT: I ended up finding no "trick" to this and implemented an additional table, consisting of the output of of the ip_address column with UNNEST() applied to it, as suggested by @max-vernon. It's more rows and has a different uniqueness constraint, but it's not a huge problem. I was just hoping there was some method I wasn't aware of to do what I wanted (efficiently) all on this single table. Thanks for the input!

Soluzione

I don't personally have a large amount of experience with PostgreSQL specifically. If I was to do this in a DBMS that didn't support inet addresses explicitly, I'd certainly tend towards a related table to store the info. Extrapolating data out into a separate table would tend to increase the performance of querying the main table for cases where you don't need inet address info, and having a discrete row-per-address in the secondary table makes for easy analysis.

In SQL Server, for IPv4 addresses and networks I would tend to use a binary(4) column and for IPv6, obviously, I'd use binary(16), if I was really concerned about keeping the size of data as small as possible.

It's entirely possible that PostgreSQL has an extension or some built-in method that would perform better.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange