Perché il mio indice TSV non viene utilizzato?

https://dba.stackexchange.com/questions/116549

29-09-2020
|

Domanda

Sto cercando di ottenere il funzionario della funzione di ricerca completo di postgres full-text.

Ho due tavoli, uno che ho creato solo per i test, e quello attuale che voglio essere in grado di cercare:

Tabella dei test:

webarchive=# \d test_sites
                            Table "public.test_sites"
   Column    |   Type   |                        Modifiers
-------------+----------+---------------------------------------------------------
 id          | integer  | not null default nextval('test_sites_id_seq'::regclass)
 content     | text     |
 tsv_content | tsvector |
Indexes:
    "test_sites_pkey" PRIMARY KEY, btree (id)
    "idx_test_web_pages_content" gin (tsv_content)
Triggers:
    web_pages_testing_content_change_trigger AFTER INSERT OR UPDATE ON test_sites FOR EACH ROW EXECUTE PROCEDURE web_pages_testing_content_update_func()

Tabella "reale":

webarchive=# \d web_pages
                                      Table "public.web_pages"
    Column    |            Type             |                       Modifiers
--------------+-----------------------------+--------------------------------------------------------
 id           | integer                     | not null default nextval('web_pages_id_seq'::regclass)
 state        | dlstate_enum                | not null
 errno        | integer                     |
 url          | text                        | not null
 starturl     | text                        | not null
 netloc       | text                        | not null
 file         | integer                     |
 priority     | integer                     | not null
 distance     | integer                     | not null
 is_text      | boolean                     |
 limit_netloc | boolean                     |
 title        | citext                      |
 mimetype     | text                        |
 type         | itemtype_enum               |
 raw_content  | text                        |
 content      | text                        |
 fetchtime    | timestamp without time zone |
 addtime      | timestamp without time zone |
 tsv_content  | tsvector                    |
Indexes:
    "web_pages_pkey" PRIMARY KEY, btree (id)
    "ix_web_pages_url" UNIQUE, btree (url)
    "idx_web_pages_content" gin (tsv_content)
    "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
    "ix_web_pages_distance" btree (distance)
    "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000
    "ix_web_pages_priority" btree (priority)
    "ix_web_pages_type" btree (type)
    "ix_web_pages_url_ops" btree (url text_pattern_ops)
Foreign-key constraints:
    "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
    web_pages_content_change_trigger AFTER INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()

Bit extra da parte, entrambi hanno una colonna content e una colonna tsv_content con un indice gin() su di esso. Esiste un trigger che aggiorna la colonna tsv_content ogni volta che viene modificata la colonna content.

Si noti che l'indice Altro gin funziona bene, e in realtà inizialmente avevo anche un indice gin (to_tsvector('english'::regconfig, content::text)) sulla colonna dei contenuti, anziché la seconda colonna, ma dopo aver atteso quell'indice per ricostruirne alcuni Tempi di test, ho deciso di utilizzare una colonna separata per pre-archiviare i valori di TSVector.

L'esecuzione di una query contro la tabella di test utilizza l'indice come mi aspetto:

webarchive=# EXPLAIN ANALYZE SELECT test_sites.id, test_sites.content, ts_rank_cd(test_sites.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM test_sites WHERE test_sites.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test_sites (cost=16.45..114.96 rows=25 width=669) (actual time=0.175..3.720 rows=143 loops=1) Recheck Cond: (tsv_content @@ to_tsquery('testing'::text)) Heap Blocks: exact=117 -> Bitmap Index Scan on idx_test_web_pages_content (cost=0.00..16.44 rows=25 width=0) (actual time=0.109..0.109 rows=143 loops=1) Index Cond: (tsv_content @@ to_tsquery('testing'::text)) Planning time: 0.414 ms Execution time: 3.800 ms (7 rows)
.

Tuttavia, la query identiche sulla tabella reale non sembra mai risultare in qualsiasi cosa ma una vecchia vecchia scansione sequenziale:

webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content, ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Seq Scan on web_pages (cost=0.00..4406819.80 rows=19751 width=505) (actual time=0.343..142325.954 rows=134949 loops=1) Filter: (tsv_content @@ to_tsquery('testing'::text)) Rows Removed by Filter: 12764373 Planning time: 0.436 ms Execution time: 142341.489 ms (5 rows)
.

Ho aumentato la mia memoria di lavoro a 3 GB per vedere se è stato il problema, e non lo è.

Inoltre, va notato che questi sono tavoli abbastanza grandi - ~ 150 GB di testo su file 4m (con righe di 4m (con righe aggiuntive da 8m in cui content / tsv_content è NULL).

La tabella test_sites ha 1/1000 delle righe di web_pages, in quanto è leggermente proibitivo sperimentare quando ogni query prende più minuti.

.
Sto usando PostgreSQL 9.5 (Sì, l'ho compilato da solo, volevo ON CONFLICT). Non sembra ancora essere un tag per questo.

Ho letto attraverso il Apri problemi con 9.5, e non riesco a vedere questo è un risultato di nessuno di loro.

.
Fresco da una completa ricostruzione dell'indice, il problema esiste ancora:

webarchive=# ANALYZE web_pages ; ANALYZE webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content, ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1 FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on web_pages (cost=10000000000.00..10005252343.30 rows=25109 width=561) (actual time=7.114..146444.168 rows=134949 loops=1) Filter: (tsv_content @@ to_tsquery('testing'::text)) Rows Removed by Filter: 13137318 Planning time: 0.521 ms Execution time: 146465.188 ms (5 rows)
.
Nota che ho letteralmente solo ANALYZEed, e Seqscan è disabilitato.

Soluzione

Bene, ho trascorso un po 'di tempo facendo un po' di spazio in più sul disco con il DB su di esso, spostando altri database su un altro SSD.

Ho quindi corso VACUUM ANALYZE attraverso l'intero database , e ora apparentemente notato che ho l'indice.

Avevo precedentemente analizzato sia analizzato e aspirato solo questa tabella , ma a quanto pare in qualche modo ha fatto una differenza per farlo in generale piuttosto quindi a una tabella specifica.Vai a figura.

webarchive=# EXPLAIN ANALYZE SELECT web_pages.id, web_pages.content FROM web_pages WHERE web_pages.tsv_content @@ to_tsquery($$testing$$); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on web_pages (cost=1185.79..93687.30 rows=23941 width=189) (actual time=41.448..152.108 rows=134949 loops=1) Recheck Cond: (tsv_content @@ to_tsquery('testing'::text)) Heap Blocks: exact=105166 -> Bitmap Index Scan on idx_web_pages_content (cost=0.00..1179.81 rows=23941 width=0) (actual time=24.940..24.940 rows=134996 loops=1) Index Cond: (tsv_content @@ to_tsquery('testing'::text)) Planning time: 0.452 ms Execution time: 154.942 ms (7 rows)
.

Ho anche avuto l'opportunità di eseguire un VACUUM FULL; ora che ho abbastanza spazio per l'elaborazione.Ho avuto un bel po 'di snodo di fila nel tavolo come ho sperimentato durante lo sviluppo e mi piacerebbe provare a consolidare qualsiasi frammentazione dei file che è risultata da questo.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange