Replace a sequential set of numbers with special character

https://dba.stackexchange.com/questions/262705

26-02-2021
|

Question

I have a varchar(200) column that contains entries such as,

ABC123124_A12312 ABC123_A1212 ABC123124_B12312 AC123124_AD12312 A12312_123 etc..

I want to replace a sequence of numbers with a single * so that I can group the distinct non-numeric patterns in the table.

The result for this set would be ABC*_A* ABC*_B* AC*_AD* A*_*

I have written the following primitive query below, it works correctly, but takes a long time to run on a huge table.

I need help with rewriting or editing it to improve it's performance. SQL Server 2014

-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '*' with just a single '*'
SELECT REPLACE
        (REPLACE
             (REPLACE
                  (REPLACE
                       (REPLACE
                            (REPLACE
                                 (REPLACE
                                      (REPLACE
                                           (REPLACE
                                                (REPLACE
                                                     (REPLACE
                                                          (REPLACE
                                                               (REPLACE(SampleID, '0', '*'),
                                                                '1', '*'),
                                                           '2', '*'),
                                                      '3', '*'),
                                                 '4', '*'),
                                            '5', '*'),
                                       '6', '*'),
                                  '7', '*'),
                             '8', '*'),
                        '9', '*')
                  , '*', '~*') -- replace each occurrence of '*' with '~*' (token plus asterisk)
             , '*~', '') -- replace in the result of the previous step each occurrence of '*~' (asterisk plus token) with '' (an empty string)
        , '~*', '*') -- replace in the result of the previous step each occurrence of '~*' (token plus asterisk) with '*' (asterisk)
        AS Pattern
FROM TABLE_X

Data

The column includes letters and numbers [A-Za-z0-9] and may also include the special characters / and _. I want to replace any sequence of numbers with *, but I do not know if the entry has special characters, and if so how many special characters.

I also do not know how many sequences of numbers are in the entry. All I know is that an entry must have a minimum of 1 number sequence.

Solution

Two factors are important for performance:

Reduce the number of string operations.

You may find it is possible to implement what you need using e.g. CHARINDEX and PATINDEX to find the start and end of groups, rather than performing very many REPLACE operations on the whole string each time.
Use the cheapest collation that provides correct results.

Binary collations are the cheapest. SQL collations (on non-Unicode data only) are a little more expensive. Windows collations are much more expensive.

For example:

DECLARE @T table
(
    SampleID varchar(200) NOT NULL UNIQUE
);

INSERT @T
    (SampleID)
VALUES
    ('ABC123124_A12312'),
    ('ABC123_A1212'),
    ('ABC123124_B12312'),
    ('AC123124_AD12312'),
    ('A12312_123'),
    ('999ABC888DEF');

SELECT
    T.SampleID,
    Pattern =
    (
        SELECT
            CASE
                WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
                WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
                ELSE ''
            END
        FROM dbo.Numbers AS N
        OUTER APPLY
        (
            SELECT 
                SUBSTRING(Bin.string, N.n, 1),
                SUBSTRING(Bin.string, N.n + 1, 1)
        ) AS Chars (prev, this)
        WHERE
            N.n BETWEEN 1 AND LEN(Bin.string)
        ORDER BY N.n
        FOR XML PATH ('')
    )
FROM @T AS T
OUTER APPLY (VALUES('$' + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);

db<>fiddle demo

That example relies on a permanent table of numbers. If needed, a table sufficient for varchar(200) is:

-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
  L0   AS (SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);

-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);

If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:

(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),

Users of SQL Server 2017 or later can leverage the built-in TRANSLATE function, which may perform better than the nested REPLACE calls.

You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?

Using the SQL# library, a complete solution would be:

SELECT 
    T.SampleID,
    SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;

Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all.

OTHER TIPS

Create Number table in any manner you like,

create table tblnumber(number int not null)
insert into tblnumber (number)
select ROW_NUMBER()over(order by a.number) from master..spt_values a
, master..spt_values b
CREATE unique clustered index CI_num on tblnumber(number)

Or you can 2000 ,3000 number only only in tblNumber only since no string is going to be that long. Keep one Number table shorter.

Use ITVF,

ALTER FUNCTION [dbo].[fn_Mask] (@pString VARCHAR(4000))
    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE!  IT WILL KILL PERFORMANCE!
RETURNS TABLE
    WITH SCHEMABINDING
AS
RETURN
WITH CTE AS (
        SELECT t.number AS N
            ,SUBSTRING(@pString, t.number, 1) col
        FROM dbo.tblNumber T
        WHERE t.number <= DATALENGTH(@pString)
        )
    ,CTE1 AS (
        SELECT c.N
            ,CASE 
                WHEN patindex('%[0-9]%', c.col) = 0
                    THEN c.col
                ELSE oa.col2
                END col1
        FROM CTE c
        OUTER APPLY (
            SELECT TOP 1 '*' AS col2
            FROM CTE c1
            WHERE c.N - c1.N = 1
                AND patindex('%[0-9]%', c1.col) = 0
                AND patindex('%[0-9]%', c.col) = 1
            ORDER BY c1.N
            ) oa
        )

SELECT TOP 1 (
        SELECT '' + col1
        FROM CTE1
        WHERE N > 1
            AND col1 IS NOT NULL
        ORDER BY N
        FOR XML path('')
        ) MaskedString
FROM CTE1 C;

Usage :

    DECLARE @T table
(
    SampleID varchar(200) NOT NULL UNIQUE
);

INSERT @T
    (SampleID)
VALUES
    ('ABC123124_A12312'),
    ('ABC123_A1212'),
    ('ABC123124_B12312'),
    ('AC123124_AD12312'),
    ('A12312_123'),
    ('A$B.C-D+E'),
    ('A2B.C-D+E'),
    ('999ABC888DEF');

    --Prefix one extra non numeric charector.it do not change output
    select SampleID,MaskedString from @T T
    cross apply(select MaskedString from [dbo].[fn_Mask]('F'+t.SampleID))ca

select MaskedString from [dbo].[fn_Mask]('F'+'999ABC888DEF')

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange