Dynamic length obfuscation method in T-SQL
-
02-10-2020 - |
Question
I would like to obfuscate (scramble) sensitive data from a SQL Server database, but in the way which will provide:
- irreversibility (the plaintext can't be derived from the obfuscated data),
- obfuscated data length needs to be the same as a length of data before obfuscation.
- obfuscated value does not need to be unique for repeated obfuscations of the same input value. To be honest, I rather like getting the same value for the same input which can used (e.g. some matching data in different tables, probably useful in test cases).
Example:
Abc -> zyx (lenght: 3)
StackOverflow -> a65vr4doqjd (lenght: 11)
Usually I avoid "home made" algorithms, so are you aware of some MS builtin solution which could provide this kind of obfuscation?
I hope I expressed my problem clearly, otherwise let me know and I'll try to add as much info as needed.
Solution
No, I am not aware of any built-in function that does exactly this. But, you can still accomplish this without doing anything too complicated.
You could use the built-in CRYPT_GEN_RANDOM function (introduced in SQL Server 2008 R2) which generates random values based on a supplied length. The output is in hex/binary values so each byte returned is represented as two alphanumeric characters (hence the / 2 + 1
part below).
DECLARE @InputString NVARCHAR(4000) = 'hello';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
2),
1,
LEN(@InputString)) AS [Obfuscated];
SET @InputString = 'test';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
2),
1,
LEN(@InputString)) AS [Obfuscated];
Returns something along the lines of:
8C108
9A7A
The only real downside here is that this needs to be done inline as CRYPT_GEN_RANDOM
cannot be used in a User-Defined Function (UDF: Scalar or Table-Valued). However, it can still be applied in a set-based approach using a CTE as shown here (just set @MaxLength
to the max length of the column being obfuscated):
DECLARE @MaxLength INT = 10;
;WITH cte AS
(
SELECT CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((@MaxLength / 2) + 1),
2) AS [Random]
)
SELECT tmp.[String],
cte.[Random],
SUBSTRING(cte.[Random], 1, LEN(tmp.[String])) AS [Obfuscated]
FROM (VALUES (N'test'), (N'Hello')) tmp(String)
CROSS JOIN cte;
Returns something along the lines of:
String Random Obfuscated
------ ------------ ----------
test F99B3888F993 F99B
Hello D3250E74F0A3 D3250
As you can see, CRYPT_GEN_RANDOM
returns a different value for each row.
Also, not sure if this is acceptable or not, but the only alpha characters returned are A
- F
.
OR, if you want the obfuscation to be repeatable for the same input value, or at least don't mind it being repeatable and prefer that this code be in a function so that it is easier to apply to multiple columns, you can use the HASHBYTES function which, like CRYPT_GEN_RANDOM
, returns hex/binary bytes. Unlike CRYPT_GEN_RANDOM
, the output length is fixed (in this case at 64 characters since I am using SHA2_256
), so I used REPLICATE
to repeat the hashed valued if the length of the input string is more than 64 characters. Also unlike CRYPT_GEN_RANDOM
, HASHBYTES
can be used in a User-Defined Function (UDF) :-).
CREATE FUNCTION dbo.Obfuscate(@InputString NVARCHAR(4000))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
SELECT SUBSTRING(REPLICATE(CONVERT(VARCHAR(8000),
HASHBYTES('SHA2_256', @InputString),
2),
(LEN(@InputString) / 64) + 1),
1,
LEN(@InputString)) AS [Obfuscated];
GO
And that can be used as follows:
SELECT tmp.[String],
LEN(tmp.[String]) AS [InputLength],
ob.[Obfuscated],
LEN(ob.[Obfuscated]) AS [OutputLength]
FROM (VALUES (N'test'), (N'Hello'), (REPLICATE(N'A', 63)),
(REPLICATE(N'B', 64)), (REPLICATE(N'C', 65)),
(REPLICATE(N'D', 4000))) tmp(String)
CROSS APPLY dbo.Obfuscate(tmp.[String]) ob;
Returns something along the lines of:
String InputLength Obfuscated OutputLength
------ ----------- ---------- ------------
test 4 FE52 4
Hello 5 A07E4 5
AAAAAAAAAAAAAAAAAAAAAA... 63 4B589C85DE74E76487730F3... 63
BBBBBBBBBBBBBBBBBBBBBB... 64 79813FB6480F354F1C6017A... 64
CCCCCCCCCCCCCCCCCCCCCC... 65 FB4B38FBA41ECC24B5B0F68... 65
DDDDDDDDDDDDDDDDDDDDDD... 4000 5D01CC6508C164E652B5C77... 4000
PLEASE NOTE: If you need alpha characters beyond A
- F
and/or need to have distinct obfuscated values for distinct input values (i.e. reduce chances of collisions), then either method above can be adapted easily enough to do that.