Dynamic Data Masking Issue when Concatenating Fields

https://dba.stackexchange.com/questions/278213

09-03-2021
|

Question

You can reproduce the issue here:

CREATE TABLE [dbo].[EmployeeDataMasking](
    [RowId] [int] IDENTITY(1,1) NOT NULL,
    [EmployeeId] [int] NULL,
    [LastName] [varchar](50) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 2)') NOT NULL,
    [FirstName] [varchar](50) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 2)') NOT NULL,
 CONSTRAINT [PK_EmployeeDataMasking] PRIMARY KEY CLUSTERED 
(
    [RowId] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY],
) ON [PRIMARY]
GO

Insert Into dbo.EmployeeDataMasking (EmployeeId, LastName, FirstName)
VALUES( 1,'Smithsonian','Daniel'),( 2,'Templeton','Ronald')

Select  
    EmployeeId,
    LastName,
    FirstName,
    LastName + ', ' + FirstName
From dbo.EmployeeDataMasking

Notice the LastName and FirstName fields are partially masked (as expected). However, the combined name field contains the default mask. I don't know if this is considered a bug. However, I would think the combined field would retain the mask of the two fields it comprises. At least that's what I would prefer, since I don't know how to provide a mask for the combined field.

Solution

Documentation is shamefully silent in regard to what is the behavior when masked column is the part of an expression.

This is how masked column is represented in execution plan:

Scalar Operator(DataMask([LastName],0x05000000,(3),(2),'XXXX',(2)))

here (3) denotes masking function type and correspond to partial masking.

And this is how LastName + ', ' + FirstName expression is represented:

Scalar Operator(DataMask([LastName]+', '+[FirstName],0x05000000,(1),(0),(0),(0)))

As you may see, the behavior is "compute, then mask". Masking function type in this case is (1), which correspond to default masking. This is the result of masking function adjustment and intentional modification.

With the "compute, then mask" approach, this (mask adjustment) is the necessary measure apparently, because of otherwise simple expressions involving LEFT, RIGHT or SUBSTRING functions could unmask data easily. And determining whether expression is "safe" or not would be too complex probably.

I can only guess why approach is "compute, then mask" and not "mask, then compute", but I think that the latter, being implemented, would suffer from its own problems. Some, that I can think of, include possibility of arithmetic errors for numeric types (such as division by zero, for example, if something is divided to masked column), or logic skew possibility (if there is a conditional expression that depends on masked column).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange