Replace a sequential set of numbers with special character
Question
I have a varchar(200) column that contains entries such as,
ABC123124_A12312
ABC123_A1212
ABC123124_B12312
AC123124_AD12312
A12312_123
etc..
I want to replace a sequence of numbers with a single *
so that I can group the distinct non-numeric patterns in the table.
The result for this set would be
ABC*_A*
ABC*_B*
AC*_AD*
A*_*
I have written the following primitive query below, it works correctly, but takes a long time to run on a huge table.
I need help with rewriting or editing it to improve it's performance. SQL Server 2014
-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '*' with just a single '*'
SELECT REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE(SampleID, '0', '*'),
'1', '*'),
'2', '*'),
'3', '*'),
'4', '*'),
'5', '*'),
'6', '*'),
'7', '*'),
'8', '*'),
'9', '*')
, '*', '~*') -- replace each occurrence of '*' with '~*' (token plus asterisk)
, '*~', '') -- replace in the result of the previous step each occurrence of '*~' (asterisk plus token) with '' (an empty string)
, '~*', '*') -- replace in the result of the previous step each occurrence of '~*' (token plus asterisk) with '*' (asterisk)
AS Pattern
FROM TABLE_X
Data
The column includes letters and numbers [A-Za-z0-9]
and may also include the special characters /
and _
. I want to replace any sequence of numbers with *
, but I do not know if the entry has special characters, and if so how many special characters.
I also do not know how many sequences of numbers are in the entry. All I know is that an entry must have a minimum of 1 number sequence.
Solution
Two factors are important for performance:
Reduce the number of string operations.
You may find it is possible to implement what you need using e.g.
CHARINDEX
andPATINDEX
to find the start and end of groups, rather than performing very manyREPLACE
operations on the whole string each time.Use the cheapest collation that provides correct results.
Binary collations are the cheapest. SQL collations (on non-Unicode data only) are a little more expensive. Windows collations are much more expensive.
For example:
DECLARE @T table
(
SampleID varchar(200) NOT NULL UNIQUE
);
INSERT @T
(SampleID)
VALUES
('ABC123124_A12312'),
('ABC123_A1212'),
('ABC123124_B12312'),
('AC123124_AD12312'),
('A12312_123'),
('999ABC888DEF');
SELECT
T.SampleID,
Pattern =
(
SELECT
CASE
WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
ELSE ''
END
FROM dbo.Numbers AS N
OUTER APPLY
(
SELECT
SUBSTRING(Bin.string, N.n, 1),
SUBSTRING(Bin.string, N.n + 1, 1)
) AS Chars (prev, this)
WHERE
N.n BETWEEN 1 AND LEN(Bin.string)
ORDER BY N.n
FOR XML PATH ('')
)
FROM @T AS T
OUTER APPLY (VALUES('$' + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);
That example relies on a permanent table of numbers. If needed, a table sufficient for varchar(200)
is:
-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
-- Destination column type integer NOT NULL
ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);
-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
If that isn't faster, you might find that using a binary collation alone would speed up your existing implementation sufficiently. To implement that, change one line of your code to:
(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),
Users of SQL Server 2017 or later can leverage the built-in TRANSLATE
function, which may perform better than the nested REPLACE
calls.
You could also use a general regex CLR function, or implement something custom in SQLCLR for this particular task. See for example SQL Server: Replace with wildcards?
Using the SQL# library, a complete solution would be:
SELECT
T.SampleID,
SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;
Full regex support is overkill for this task, so if you are able to use SQLCLR, coding a specific function for your needs would probably be the best performing solution of all.
OTHER TIPS
Create Number table in any manner you like,
create table tblnumber(number int not null)
insert into tblnumber (number)
select ROW_NUMBER()over(order by a.number) from master..spt_values a
, master..spt_values b
CREATE unique clustered index CI_num on tblnumber(number)
Or you can 2000 ,3000
number only only in tblNumber only since no string is going to be that long. Keep one Number table shorter.
Use ITVF,
ALTER FUNCTION [dbo].[fn_Mask] (@pString VARCHAR(4000))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH CTE AS (
SELECT t.number AS N
,SUBSTRING(@pString, t.number, 1) col
FROM dbo.tblNumber T
WHERE t.number <= DATALENGTH(@pString)
)
,CTE1 AS (
SELECT c.N
,CASE
WHEN patindex('%[0-9]%', c.col) = 0
THEN c.col
ELSE oa.col2
END col1
FROM CTE c
OUTER APPLY (
SELECT TOP 1 '*' AS col2
FROM CTE c1
WHERE c.N - c1.N = 1
AND patindex('%[0-9]%', c1.col) = 0
AND patindex('%[0-9]%', c.col) = 1
ORDER BY c1.N
) oa
)
SELECT TOP 1 (
SELECT '' + col1
FROM CTE1
WHERE N > 1
AND col1 IS NOT NULL
ORDER BY N
FOR XML path('')
) MaskedString
FROM CTE1 C;
Usage :
DECLARE @T table
(
SampleID varchar(200) NOT NULL UNIQUE
);
INSERT @T
(SampleID)
VALUES
('ABC123124_A12312'),
('ABC123_A1212'),
('ABC123124_B12312'),
('AC123124_AD12312'),
('A12312_123'),
('A$B.C-D+E'),
('A2B.C-D+E'),
('999ABC888DEF');
--Prefix one extra non numeric charector.it do not change output
select SampleID,MaskedString from @T T
cross apply(select MaskedString from [dbo].[fn_Mask]('F'+t.SampleID))ca
select MaskedString from [dbo].[fn_Mask]('F'+'999ABC888DEF')