Question

I would like to obfuscate (scramble) sensitive data from a SQL Server database, but in the way which will provide:

  • irreversibility (the plaintext can't be derived from the obfuscated data),
  • obfuscated data length needs to be the same as a length of data before obfuscation.
  • obfuscated value does not need to be unique for repeated obfuscations of the same input value. To be honest, I rather like getting the same value for the same input which can used (e.g. some matching data in different tables, probably useful in test cases).

Example:

Abc -> zyx (lenght: 3)
StackOverflow -> a65vr4doqjd (lenght: 11)

Usually I avoid "home made" algorithms, so are you aware of some MS builtin solution which could provide this kind of obfuscation?

I hope I expressed my problem clearly, otherwise let me know and I'll try to add as much info as needed.

Was it helpful?

Solution

No, I am not aware of any built-in function that does exactly this. But, you can still accomplish this without doing anything too complicated.

You could use the built-in CRYPT_GEN_RANDOM function (introduced in SQL Server 2008 R2) which generates random values based on a supplied length. The output is in hex/binary values so each byte returned is represented as two alphanumeric characters (hence the / 2 + 1 part below).

DECLARE @InputString NVARCHAR(4000) = 'hello';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
                         CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
                         2),
                 1,
                 LEN(@InputString)) AS [Obfuscated];

SET @InputString = 'test';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
                         CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
                         2),
                 1,
                 LEN(@InputString)) AS [Obfuscated];

Returns something along the lines of:

8C108

9A7A

The only real downside here is that this needs to be done inline as CRYPT_GEN_RANDOM cannot be used in a User-Defined Function (UDF: Scalar or Table-Valued). However, it can still be applied in a set-based approach using a CTE as shown here (just set @MaxLength to the max length of the column being obfuscated):

DECLARE @MaxLength INT = 10;

;WITH cte AS
(
    SELECT CONVERT(VARCHAR(8000),
                   CRYPT_GEN_RANDOM((@MaxLength / 2) + 1),
                2) AS [Random]

)
SELECT tmp.[String],
       cte.[Random],
       SUBSTRING(cte.[Random], 1, LEN(tmp.[String])) AS [Obfuscated]
FROM   (VALUES (N'test'), (N'Hello')) tmp(String)
CROSS JOIN  cte;

Returns something along the lines of:

String    Random          Obfuscated
------    ------------    ----------
test      F99B3888F993    F99B
Hello     D3250E74F0A3    D3250

As you can see, CRYPT_GEN_RANDOM returns a different value for each row.

Also, not sure if this is acceptable or not, but the only alpha characters returned are A - F.


OR, if you want the obfuscation to be repeatable for the same input value, or at least don't mind it being repeatable and prefer that this code be in a function so that it is easier to apply to multiple columns, you can use the HASHBYTES function which, like CRYPT_GEN_RANDOM, returns hex/binary bytes. Unlike CRYPT_GEN_RANDOM, the output length is fixed (in this case at 64 characters since I am using SHA2_256), so I used REPLICATE to repeat the hashed valued if the length of the input string is more than 64 characters. Also unlike CRYPT_GEN_RANDOM, HASHBYTES can be used in a User-Defined Function (UDF) :-).

CREATE FUNCTION dbo.Obfuscate(@InputString NVARCHAR(4000))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
  SELECT SUBSTRING(REPLICATE(CONVERT(VARCHAR(8000),
                                     HASHBYTES('SHA2_256', @InputString),
                                     2),
                             (LEN(@InputString) / 64) + 1),
                   1,
                   LEN(@InputString)) AS [Obfuscated];
GO

And that can be used as follows:

SELECT tmp.[String],
       LEN(tmp.[String]) AS [InputLength],
       ob.[Obfuscated],
      LEN(ob.[Obfuscated]) AS [OutputLength]
FROM   (VALUES (N'test'), (N'Hello'), (REPLICATE(N'A', 63)),
               (REPLICATE(N'B', 64)), (REPLICATE(N'C', 65)),
               (REPLICATE(N'D', 4000))) tmp(String)
CROSS APPLY dbo.Obfuscate(tmp.[String]) ob;

Returns something along the lines of:

String                       InputLength    Obfuscated                    OutputLength
------                       -----------    ----------                    ------------
test                               4        FE52                                 4
Hello                              5        A07E4                                5
AAAAAAAAAAAAAAAAAAAAAA...         63        4B589C85DE74E76487730F3...          63
BBBBBBBBBBBBBBBBBBBBBB...         64        79813FB6480F354F1C6017A...          64
CCCCCCCCCCCCCCCCCCCCCC...         65        FB4B38FBA41ECC24B5B0F68...          65
DDDDDDDDDDDDDDDDDDDDDD...       4000        5D01CC6508C164E652B5C77...        4000

PLEASE NOTE: If you need alpha characters beyond A - F and/or need to have distinct obfuscated values for distinct input values (i.e. reduce chances of collisions), then either method above can be adapted easily enough to do that.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top