How can I improve the performance of this query on a table with millions of trigrams?

https://dba.stackexchange.com/questions/264712

27-02-2021
|

Question

In an effort to make searching across many string fields in certain tables faster, I've been attempting to use trigrams.

I've created a separate table to hold them, and a query to search against them (intended to be used in a table-valued-function).

CREATE TABLE [dbo].[SearchTrigramTwoFieldKey]
(
  [Ordinal]                   BIGINT          NOT NULL,
  [SearchCategoryId]          INTEGER         NOT NULL    CONSTRAINT [FK__SearchTrigramTwoFieldKey_SearchCategoryId_To_dbo.SearchCategory_Id]               FOREIGN KEY([SearchCategoryId])         REFERENCES [dbo].[SearchCategory]([Id]),
  [SearchCategoryColumnId]    INTEGER         NOT NULL    CONSTRAINT [FK__SearchTrigramTwoFieldKey_SearchCategoryColumnId_To_dbo.SearchCategoryColumn_Id]   FOREIGN KEY([SearchCategoryColumnId])   REFERENCES [dbo].[SearchCategoryColumn]([Id]),
  [TableId]                   INTEGER         NOT NULL    CONSTRAINT [FK__SearchTrigramTwoFieldKey_TableId_To_dbo.Table_Id]                                 FOREIGN KEY([TableId])                  REFERENCES [dbo].[Table]([Id]),
  [RecordId1]                 BIGINT          NOT NULL,
  [RecordId2]                 BIGINT          NOT NULL,
  [Trigram]                   NVARCHAR(3)     NOT NULL,
  [IsLastTrigram]             BIT             NOT NULL,
  [RecordColumnTrigramCount]  INTEGER         NOT NULL,

  CONSTRAINT [PK__SearchTrigramTwoFieldKey_SearchCategoryId_SearchCategoryColumnId_TableId_RecordId1_RecordId2_Ordinal]
      PRIMARY KEY
      (
          [SearchCategoryId]          ASC,
          [SearchCategoryColumnId]    ASC,
          [TableId]                   ASC,
          [RecordId1]                 ASC,
          [RecordId2]                 ASC,
          [Ordinal]                   ASC
      ),
)

CREATE UNIQUE NONCLUSTERED INDEX [UNCI__SearchTrigramTwoFieldKey_IsLastTrigram] ON [dbo].[SearchTrigramTwoFieldKey]
(
  [SearchCategoryId] ASC,
  [SearchCategoryColumnId] ASC,
  [TableId] ASC,
  [RecordId1] ASC,
  [RecordId2] ASC,
  [IsLastTrigram] ASC
)
WHERE ([IsLastTrigram]=(1))

The last two fields are an attempt to reduce the amount of calculation needing to be done in the search query against this table to try and speed up performance, with the index as a precaution against bad data.

After inserting all the trigrams there are approximately 60 million records in this table. This number will almost certainly increase over time.

To search it, I've written the following query:

--Setting up query parameters:
DECLARE @SearchCategoryId INTEGER = 3
DECLARE @SearchCategoryColumnIds AS TABLE([Value] INTEGER NOT NULL)
DECLARE @searchValues AS TABLE([Value] NVARCHAR(4000))


INSERT INTO @searchValues([Value])
VALUES('Land'), ('Ireland')

--The query itself:
SELECT  ROW_NUMBER() OVER (ORDER BY COUNT(CASE WHEN IsExactMatch = 1 THEN 1 END) DESC,
                                  COUNT(*) DESC,
                                  MIN(CASE WHEN IsExactMatch = 0 THEN MinMatchDistanceRowOrder END)) AS [MatchOrder],
      RecordId1,
      RecordId2
FROM
(
SELECT  RecordId1, RecordId2,
      IIF(MIN([T].T2Ordinal) = 1 AND MAX(CAST(T.T2IsLastTrigram AS INTEGER)) = 1, 1, 0) AS IsExactMatch,
      ROW_NUMBER() OVER (ORDER BY MIN(T.T2TrigramCount - T1TrigramCount)) AS MinMatchDistanceRowOrder,
      [SearchValue]
FROM
(SELECT T1.SearchValueNumber,
      T1.SearchValue,
      LAG(T1.Ordinal)         OVER (PARTITION BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.SearchValueNumber ORDER BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.Ordinal)
                                  AS T1OrdinalLag,
      T1.Ordinal          AS T1Ordinal,
      LEAD(T1.Ordinal)        OVER (PARTITION BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.SearchValueNumber ORDER BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.Ordinal)
                                  AS T1OrdinalLead,
      T1.NgramCount     AS T1TrigramCount,
      LAG(T2.Ordinal)         OVER (PARTITION BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.SearchValueNumber ORDER BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T2.Ordinal, T2.Trigram)
                                  AS T2OrdinalLag,
      T2.Ordinal          AS T2Ordinal,
      LEAD(T2.Ordinal)        OVER (PARTITION BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.SearchValueNumber ORDER BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T2.Ordinal, T2.Trigram)
                                  AS T2OrdinalLead,
      T2.IsLastTrigram    AS T2IsLastTrigram,
      MIN(T2.Ordinal)         OVER (PARTITION BY T2.SearchCategoryId, T2.SearchCategoryColumnId, T2.TableId, T2.RecordId1, T2.RecordId2, T1.SearchValueNumber)
                                  AS MinOrdinal,
      T2.RecordColumnTrigramCount  AS T2TrigramCount,
      T2.SearchCategoryId,
      T2.SearchCategoryColumnId,
      T2.TableId,
      T2.RecordId1,
      T2.RecordId2
FROM dbo.SearchTrigramTwoFieldKey AS T2
INNER JOIN
(
  SELECT [Value] FROM @SearchCategoryColumnIds
  UNION ALL
  SELECT NULL) AS scc ON NOT EXISTS(SELECT TOP 1 [Value] FROM @SearchCategoryColumnIds) OR T2.SearchCategoryColumnId = [Value]
INNER JOIN
(
  SELECT SearchValueNumber, SearchValue, ngrams.Ordinal, ngrams.Ngram, ngrams.IsLastNgram, ngrams.NgramCount
  FROM
  (
      SELECT  ROW_NUMBER() OVER (ORDER BY [Value]) AS SearchValueNumber, *
      FROM
      (
          SELECT DISTINCT [Value] AS SearchValue, *
          FROM @searchValues
      ) AS T
  ) AS [sv]
  CROSS APPLY dbo.fnGenerateNgrams([sv].[Value], DEFAULT) AS ngrams
) AS T1 ON T1.Ngram = T2.Trigram
WHERE T2.SearchCategoryId = @SearchCategoryId) AS T
WHERE
(
  (   T1OrdinalLead IS NULL OR T1OrdinalLead = T1Ordinal+1)
OR  (T1OrdinalLag IS NULL OR T1OrdinalLag = T1Ordinal-1)
)
AND
(
  (   T2OrdinalLead IS NULL OR T2OrdinalLead = T2Ordinal+1)
OR  (T2OrdinalLag IS NULL OR T2OrdinalLag = T2Ordinal-1)
)
AND T2TrigramCount >= T1TrigramCount
GROUP BY SearchCategoryId, SearchCategoryColumnId, TableId, RecordId1, RecordId2, [SearchValue]
HAVING COUNT(*) >= (SELECT TOP 1 NGramCount FROM dbo.fnGenerateNgrams([SearchValue], DEFAULT))
) AS T
GROUP BY RecordId1, RecordId2
HAVING COUNT(DISTINCT [SearchValue]) = (SELECT COUNT(DISTINCT [Value]) FROM @searchValues)
ORDER BY MatchOrder ASC
OPTION(RECOMPILE)

Some notes on the query:

Intended to take multiple search terms, generate the trigrams for those terms and match them against the trigrams in the trigram table.
- All specified search terms must be matched at least once per record that each collection of trigrams is for.
When broken down, the trigram order should be preserved so that correct matches are found.
Should return the MatchOrder so that we can order by the closest matches afterwards if needed.
- Matches are intended to be ordered by the number of exact matches, the number of

This query is the only one that will be querying this table. There will be data insertions and deletions every set period of time to refresh updated data, but the speed of those isn't of particular concern at the moment.

Execution times vary wildly depending on the search values specified, even if it's just single values (I've seen some as short as 6 seconds, and others take around 5 minutes for just two words), and I suspect (but am not certain) it's due to how much data matches some of the trigrams, even if they're not a complete match in the end.

From looking at the execution plan in SSMS and Plan Explorer, I believe it looks like it is sorts that are eating the time, but I'm unsure how to properly correct this with indexes.

These are the indexes I've created so far on the trigram table (in addition to its primary clustered index and the unique non-clustered index above) in an attempt to improve execution speed:

CREATE NONCLUSTERED INDEX [NCI__SearchTgramTwoFieldKey_SearchCategoryColumnId_TableId_RecordId1_RecordId2_Ordinal_IsLastTgram_RecordColumnTgramCount_Tgram] ON [dbo].[SearchTrigramTwoFieldKey]
(
    [SearchCategoryColumnId] ASC,
    [TableId] ASC,
    [RecordId1] ASC,
    [RecordId2] ASC,
    [Ordinal] ASC,
    [IsLastTrigram] ASC,
    [RecordColumnTrigramCount] ASC,
    [Trigram] ASC
)

CREATE NONCLUSTERED INDEX [NCI__SearchTrigramTwoFieldKey_SearchCategoryColumnId_TableId_RecordId1_RecordId2] ON [dbo].[SearchTrigramTwoFieldKey]
(
    [SearchCategoryColumnId] ASC,
    [TableId] ASC,
    [RecordId1] ASC,
    [RecordId2] ASC
)    

CREATE NONCLUSTERED INDEX [NCI__SearchTrigramTwoFieldKey_SearchCategoryColumnId_TableId_RecordId1_RecordId2_Ordinal] ON [dbo].[SearchTrigramTwoFieldKey]
(
    [SearchCategoryColumnId] ASC,
    [TableId] ASC,
    [RecordId1] ASC,
    [RecordId2] ASC,
    [Ordinal] ASC
)    

CREATE NONCLUSTERED INDEX [NCI__SearchTrigramTwoFieldKey_SearchCategoryId_Trigram__Include_IsLastTrigram_RecordColumnTrigramCount] ON [dbo].[SearchTrigramTwoFieldKey]
(
    [SearchCategoryId] ASC,
    [Trigram] ASC
)
INCLUDE (   [IsLastTrigram], RecordColumnTrigramCount])

Of these four indexes, only the last one is the one it recommended I create. The others are all experimental to try and improve performance.

Execution plan: https://www.brentozar.com/pastetheplan/?id=HyFZDlTDI

Despite my efforts, performance is still far from where I'd like. I want to try to speed up the execution time as much as possible, with the best case scenario having it to take less than a second for one or more search terms, but I don't know how feasible that is.

I lack sufficient knowledge of indexing to understand how to properly address this (assuming that indexing is the correct way to address this). I'm looking to learn what I can do to improve the performance here (and why it will improve performance), either through proper indexing or through improving the query whilst maintaining its functionality, if possible.

I've included the query and table-definition in case they reveal some ghastly (but correctable) inefficiency that I've not realised exists.

Solution

I don't think indexing is your (main) problem here.

There are some strange and troubling things related to timing in that execution plan. The first of which is a disparity between duration and CPU:

<QueryTimeStats CpuTime="93275" ElapsedTime="315874" />

The query ran for 5 minutes, but only used 1.5 minutes of CPU time (at DOP 1). This difference often means SQL Server is waiting on some shared resource, and not making progress running your query.

Some wait stats are captured in the execution plan:

<WaitStats>
  <Wait WaitType="RESOURCE_GOVERNOR_IDLE" WaitTimeMs="103626" WaitCount="35266" />
  <Wait WaitType="PAGELATCH_EX" WaitTimeMs="77512" WaitCount="2742411" />
  <Wait WaitType="PAGELATCH_SH" WaitTimeMs="66027" WaitCount="2037681" />
  <Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="7798" WaitCount="2440" />
  <Wait WaitType="RESERVED_MEMORY_ALLOCATION_EXT" WaitTimeMs="41" WaitCount="38422" />
</WaitStats>

Resource Governor

There are over 103 seconds of RESOURCE_GOVERNOR_IDLE waits. Normally, I'd recommend that you check the server configuration and make sure that you are not being too-heavily capped as far as CPU allocation goes using a query like this:

SELECT 
    rgrp.[name],
    rgrp.min_cpu_percent,
    rgrp.max_cpu_percent, 
    rgrp.cap_cpu_percent
FROM sys.dm_resource_governor_resource_pools rgrp;

Since you are using Azure SQL Database, instead you'll need to upgrade to a tier with more compute. I noticed this in the plan XML as well:

NonParallelPlanReason="EstimatedDOPIsOne"

I think the smallest vCore options are 2, so this implies you're using one of the smallest DTU model offerings (less than S3).

Trying scaling up your database one tier at a time until you see the RESOURCE_GOVERNOR_IDLE waits reduce to a more acceptable level.

Note: this is likely contributing to the 7 seconds of SOS_SCHEDULER_YIELD as well.

Latch Waits

You also have 143 seconds of latch waits. Normally I would suspect this is some kind of tempdb contention, but there's not much evidence of tempdb usage in this query (there is one ~200 MB hash spill, and some small-ish spools).

Given the CPU cap issue, I suspect this unexpectedly high level of latch waits is related to the Azure service tier being used as well.

With the Waits

Subtracting the 246 seconds of waits discussed above, that drops the query runtime from 315 seconds to 69 seconds. It's still not amazing, but certainly better than 5 minutes. On a higher Azure service tier you may also benefit from parallel execution, further reducing the runtime.

Other approaches

A different trigram implementation

If you'd prefer not to "throw hardware" at this problem, and are interested in an alternate approach, Paul White has written a very performance-conscious trigram search function and shared it here: Trigram Wildcard String Search in SQL Server

Of course this is essentially changing your whole approach, so you'll have to weigh the cost vs benefit of scrapping / rewriting what you've done so far against bumping up the Azure costs.

Leveraging batch mode

Conor Cunningham suggested, as an experiment, to try leveraging batch mode - either using the new batch mode on rowstore, or by creating a columnstore index on the table:

...we added batch mode on rowstore in the more recent compat levels so please consider that - it will provide more benefit at higher DOPs, however. Also, a columnstore index may be an experiment to consider...

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange