سؤال

I have a large-ish data warehouse staging table (~2.5 million rows) that is using a mapping table to standardize a text column. I have an update statement that works perfectly, but has terrible performance, and would like some recommendations on modifications to the statement and/or alternate strategies for standardizing the data.

The source text can contain anything, and I'm basically looking for keywords in the strings. Because of the nature of the data, a longer string is more specific and therefore correct, so I want to get the longest mapping value found for each given row in the staging table.

Source data (simplified):

CREATE TABLE STAGINGTABLE (TextColumn VARCHAR(100));

INSERT INTO STAGINGTABLE VALUES ('This entry has a full foo bar entry');
INSERT INTO STAGINGTABLE VALUES ('Today was a foo day');
INSERT INTO STAGINGTABLE VALUES ('Only found bar data');
INSERT INTO STAGINGTABLE VALUES ('Blah blah blah');

CREATE TABLE MAPPING (SourceMap VARCHAR(100), TargetMap VARCHAR(100));

INSERT INTO MAPPING VALUES ('%FOO BAR%', 'Foo Bar');
INSERT INTO MAPPING VALUES ('%FOO%', 'Foo');
INSERT INTO MAPPING VALUES ('%BAR%', 'Bar');
INSERT INTO MAPPING VALUES ('%%', 'Unknown');

Update statement:

UPDATE STAGINGTABLE
SET TextColumn = (
    SELECT TargetMap
    FROM (
        SELECT TargetMap
        , ROW_NUMBER() OVER (PARTITION BY TextColumn
            ORDER BY LEN(SourceMap) DESC) AS ROWNUM
        FROM STAGINGTABLE
        INNER JOIN MAPPING
            ON UPPER(TextColumn) LIKE SourceMap
        WHERE TextColumn IS NOT NULL
    ) AS STG_MAP
    WHERE ROWNUM = 1
);

Final values in staging table after mapping:

Foo Bar
Foo
Bar
Unknown

Edit: the TextColumn and SourceMap columns do have non-clustered indexes.

هل كانت مفيدة؟

المحلول

Thanks to everyone's comments and suggestions, I tried all of the ideas @mustaccio posted, but in the end the fastest strategy was to abandon the mapping table altogether, and just go with a big ugly CASE statement to handle all of the situations, and insert that into another staging table. Execution time dropped to only a few seconds, and I can't argue with that. The situations to handle are mostly static, so if I have to re-deploy a SSIS package once or twice a year to add new situations if they come up, so be it.

Final example insert statement (not exact, just going off memory):

INSERT INTO STAGINGTABLE_TWO (TextColumn) 
SELECT 
  CASE
    WHEN TextColumn LIKE '%FOO BAR%' THEN 'Foo Bar'
    WHEN TextColumn LIKE '%FOO%' THEN 'Foo'
    WHEN TextColumn LIKE '%BAR%' THEN 'Bar'
    ELSE 'Unknown'
  END
FROM STAGINGTABLE;
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى dba.stackexchange
scroll top