Dynamic Data Masking Issue when Concatenating Fields
-
09-03-2021 - |
Question
You can reproduce the issue here:
CREATE TABLE [dbo].[EmployeeDataMasking](
[RowId] [int] IDENTITY(1,1) NOT NULL,
[EmployeeId] [int] NULL,
[LastName] [varchar](50) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 2)') NOT NULL,
[FirstName] [varchar](50) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 2)') NOT NULL,
CONSTRAINT [PK_EmployeeDataMasking] PRIMARY KEY CLUSTERED
(
[RowId] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY],
) ON [PRIMARY]
GO
Insert Into dbo.EmployeeDataMasking (EmployeeId, LastName, FirstName)
VALUES( 1,'Smithsonian','Daniel'),( 2,'Templeton','Ronald')
Select
EmployeeId,
LastName,
FirstName,
LastName + ', ' + FirstName
From dbo.EmployeeDataMasking
Notice the LastName and FirstName fields are partially masked (as expected). However, the combined name field contains the default mask. I don't know if this is considered a bug. However, I would think the combined field would retain the mask of the two fields it comprises. At least that's what I would prefer, since I don't know how to provide a mask for the combined field.
Solution
Documentation is shamefully silent in regard to what is the behavior when masked column is the part of an expression.
This is how masked column is represented in execution plan:
Scalar Operator(DataMask([LastName],0x05000000,(3),(2),'XXXX',(2)))
here (3)
denotes masking function type and correspond to partial
masking.
And this is how LastName + ', ' + FirstName
expression is represented:
Scalar Operator(DataMask([LastName]+', '+[FirstName],0x05000000,(1),(0),(0),(0)))
As you may see, the behavior is "compute, then mask". Masking function type in this case is (1)
, which correspond to default
masking. This is the result of masking function adjustment and intentional modification.
With the "compute, then mask" approach, this (mask adjustment) is the necessary measure apparently, because of otherwise simple expressions involving LEFT
, RIGHT
or SUBSTRING
functions could unmask data easily. And determining whether expression is "safe" or not would be too complex probably.
I can only guess why approach is "compute, then mask" and not "mask, then compute", but I think that the latter, being implemented, would suffer from its own problems. Some, that I can think of, include possibility of arithmetic errors for numeric types (such as division by zero, for example, if something is divided to masked column), or logic skew possibility (if there is a conditional expression that depends on masked column).