私の tsv インデックスが使用されないのはなぜですか?

https://dba.stackexchange.com/questions/116549

29-09-2020
|

質問

postgresの全文検索機能を機能させようとしています。

テーブルが 2 つあり、1 つはテスト用に作成したもので、もう 1 つは検索できるようにするためのものです。

テストテーブル:

webarchive=# \d test_sites
                            Table "public.test_sites"
   Column    |   Type   |                        Modifiers
-------------+----------+---------------------------------------------------------
 id          | integer  | not null default nextval('test_sites_id_seq'::regclass)
 content     | text     |
 tsv_content | tsvector |
Indexes:
    "test_sites_pkey" PRIMARY KEY, btree (id)
    "idx_test_web_pages_content" gin (tsv_content)
Triggers:
    web_pages_testing_content_change_trigger AFTER INSERT OR UPDATE ON test_sites FOR EACH ROW EXECUTE PROCEDURE web_pages_testing_content_update_func()

「実際の」テーブル:

webarchive=# \d web_pages
                                      Table "public.web_pages"
    Column    |            Type             |                       Modifiers
--------------+-----------------------------+--------------------------------------------------------
 id           | integer                     | not null default nextval('web_pages_id_seq'::regclass)
 state        | dlstate_enum                | not null
 errno        | integer                     |
 url          | text                        | not null
 starturl     | text                        | not null
 netloc       | text                        | not null
 file         | integer                     |
 priority     | integer                     | not null
 distance     | integer                     | not null
 is_text      | boolean                     |
 limit_netloc | boolean                     |
 title        | citext                      |
 mimetype     | text                        |
 type         | itemtype_enum               |
 raw_content  | text                        |
 content      | text                        |
 fetchtime    | timestamp without time zone |
 addtime      | timestamp without time zone |
 tsv_content  | tsvector                    |
Indexes:
    "web_pages_pkey" PRIMARY KEY, btree (id)
    "ix_web_pages_url" UNIQUE, btree (url)
    "idx_web_pages_content" gin (tsv_content)
    "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
    "ix_web_pages_distance" btree (distance)
    "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000
    "ix_web_pages_priority" btree (priority)
    "ix_web_pages_type" btree (type)
    "ix_web_pages_url_ops" btree (url text_pattern_ops)
Foreign-key constraints:
    "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
    web_pages_content_change_trigger AFTER INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()

余分な部分はさておき、どちらも content コラムと tsv_content の列 gin() その上のインデックス。を更新するトリガーがあります tsv_content 毎回コラム content 列が変更されます。

注意してください。他の gin インデックスは正常に機能し、実際には最初は gin (to_tsvector('english'::regconfig, content::text)) 2 番目の列ではなく、コンテンツ列にもインデックスを追加しましたが、テスト中にそのインデックスが再構築されるのを数回待った後、別の列を使用して tsvector 値を事前に格納することにしました。

テストテーブルに対してクエリを実行すると、予想どおりインデックスが使用されます。

webarchive=# EXPLAIN ANALYZE SELECT
    test_sites.id,
    test_sites.content,
    ts_rank_cd(test_sites.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
FROM
    test_sites
WHERE
    test_sites.tsv_content @@ to_tsquery($$testing$$);
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_sites  (cost=16.45..114.96 rows=25 width=669) (actual time=0.175..3.720 rows=143 loops=1)
   Recheck Cond: (tsv_content @@ to_tsquery('testing'::text))
   Heap Blocks: exact=117
   ->  Bitmap Index Scan on idx_test_web_pages_content  (cost=0.00..16.44 rows=25 width=0) (actual time=0.109..0.109 rows=143 loops=1)
         Index Cond: (tsv_content @@ to_tsquery('testing'::text))
 Planning time: 0.414 ms
 Execution time: 3.800 ms
(7 rows)

しかし 全く同じ 実際のテーブルに対するクエリでは何も結果が得られないようです しかし 単純な古いシーケンシャルスキャン:

webarchive=# EXPLAIN ANALYZE SELECT
       web_pages.id,
       web_pages.content,
       ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
   FROM
       web_pages
   WHERE
       web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on web_pages  (cost=0.00..4406819.80 rows=19751 width=505) (actual time=0.343..142325.954 rows=134949 loops=1)
   Filter: (tsv_content @@ to_tsquery('testing'::text))
   Rows Removed by Filter: 12764373
 Planning time: 0.436 ms
 Execution time: 142341.489 ms
(5 rows)

それが問題かどうかを確認するためにワークメモリを 3 GB に増やしましたが、問題はありませんでした。

さらに、これらはかなり大きなテーブルであることに注意してください。400 万行にわたる最大 150 GB のテキスト (800 万行が追加されます) content/tsv_content は NULL).

の test_sites テーブルには行の 1/1000 があります web_pages, すべてのクエリに数分かかる場合、実験するのは少し法外なためです。

私はpostgresql 9.5を使用しています（はい、私はそれを自分でコンパイルしました、私は欲しかったです） ON CONFLICT）。まだそのタグはないようです。

私は一通り読みました未解決の問題 9.5 では、これがそれらのいずれかの結果であるとは思えません。

インデックスを完全に再構築した直後でも、問題は依然として存在します。

webarchive=# ANALYZE web_pages ;
ANALYZE
webarchive=# EXPLAIN ANALYZE SELECT
    web_pages.id,
    web_pages.content,
    ts_rank_cd(web_pages.tsv_content, to_tsquery($$testing$$)) AS ts_rank_cd_1
FROM
    web_pages
WHERE
    web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on web_pages  (cost=10000000000.00..10005252343.30 rows=25109 width=561) (actual time=7.114..146444.168 rows=134949 loops=1)
   Filter: (tsv_content @@ to_tsquery('testing'::text))
   Rows Removed by Filter: 13137318
 Planning time: 0.521 ms
 Execution time: 146465.188 ms
(5 rows)

私は文字通りただ単に ANALYZEed、seqscan は無効になっています。

解決

さて、私は DB を含むディスク上に余分なスペースを作成し、他のデータベースを別の SSD に移動するのに少し時間を費やしました。

それから私は走りました VACUUM ANALYZE データベース全体にわたって, 、そして今、どうやら私がインデックスを持っていることに気づいたようです。

以前は分析とバキュームの両方を行っていました このテーブルだけ, 、しかしどうやら、特定のテーブルではなく一般的にそれを行うことで何らかの違いが生じたようです。ゴーフィギュア。

webarchive=# EXPLAIN ANALYZE SELECT
    web_pages.id,
    web_pages.content
FROM
    web_pages
WHERE
    web_pages.tsv_content @@ to_tsquery($$testing$$);
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on web_pages  (cost=1185.79..93687.30 rows=23941 width=189) (actual time=41.448..152.108 rows=134949 loops=1)
   Recheck Cond: (tsv_content @@ to_tsquery('testing'::text))
   Heap Blocks: exact=105166
   ->  Bitmap Index Scan on idx_web_pages_content  (cost=0.00..1179.81 rows=23941 width=0) (actual time=24.940..24.940 rows=134996 loops=1)
         Index Cond: (tsv_content @@ to_tsquery('testing'::text))
 Planning time: 0.452 ms
 Execution time: 154.942 ms
(7 rows)

私もこの機会を利用して、 VACUUM FULL; これで、処理に十分なスペースが確保されました。開発中に実験を行っていたため、テーブル内でかなりの行チャーンが発生しており、その結果生じたファイルの断片化を統合したいと考えています。

ライセンス： CC-BY-SA と帰属

所属していません dba.stackexchange