Domanda

In a postgres 9.4 database, I have a table with a field identifiers: varchar(512)[] that contains a list of identifiers for each row. When adding a new row, I want to make sure its identifiers list does not overlap with any existing row, so I want to search first for such a conflicting row using a GIN index on that column.


    Column      |           Type           |                        Modifiers                         
----------------+--------------------------+-----------------------------------------------------------
id              | integer                  | not null default nextval('my_table_id_seq'::regclass)    
identifiers     | character varying(512)[] | 
... other columns ...

Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)
    "gin_idx_identifiers" gin (identifiers) WITH (fastupdate='on')
    ... other indexes ...

The problem is that by default, the following query does not use the GIN index:

EXPLAIN ANALYZE
SELECT * FROM "my_table"
WHERE "my_table"."identifiers" && ARRAY['a_sample_identifier']::varchar(512)[] LIMIT 1;

                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..20.97 rows=1 width=1368) (actual time=31120.985..31120.986 rows=1 loops=1)
   ->  Seq Scan on my_table  (cost=0.00..4318413.15 rows=205920 width=1368) (actual time=31120.983..31120.983 rows=1 loops=1)
         Filter: (identifiers && '{a_sample_identifier}'::character varying(512)[])
         Rows Removed by Filter: 14572058
 Planning time: 0.135 ms
 Execution time: 31121.023 ms

If I set enable_seqscan = off, then the index is used (and the query is much faster, unsurprisingly):

                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1755.88..1757.62 rows=1 width=1368) (actual time=0.065..0.065 rows=1 loops=1)
   ->  Bitmap Heap Scan on my_table  (cost=1755.88..359284.85 rows=205920 width=1368) (actual time=0.064..0.064 rows=1 loops=1)
         Recheck Cond: (identifiers && '{a_sample_identifier}'::character varying(512)[])
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on gin_idx_identifiers  (cost=0.00..1704.40 rows=205920 width=0) (actual time=0.056..0.056 rows=1 loops=1)
               Index Cond: (identifiers && '{a_sample_identifier}'::character varying(512)[])
 Planning time: 0.126 ms
 Execution time: 0.102 ms
(8 rows)

However, I have read that setting enable_seqscan = off globally is dangerous. How should I force postgres to use the index?

È stato utile?

Soluzione

You can use an EXCLUDE constraint to have the system automatically do the check for you. Once in place, it will automatically throw an error if you try to insert a row that violates the constraint. This is probably your best option. If you first check manually and then insert the row, there could be a race condition which allows two conflicting rows to be inserted successfully.

If for some reason you really need to do the check manually, you could re-write the query to use a COUNT(*) rather than LIMIT 1, and then check to see if the count is zero or not.

The problem is that the planner things the seq scan will find one matching row very quickly, and can then stop. Writing the query with a COUNT(*) will force the planner to realize the true cost of the seq scan. Alternatively, you could just drop the LIMIT 1 from the query, and on the client side have it fetch only one row and then not fetch any more.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a dba.stackexchange
scroll top