Question

I am working on a VERY large database (10B + rows) that performs matching on SSN and BirthDate to try and find duplicate records. The table is using columnstore compression (SQL SERVER 2016) and it occurred to me that i could save the SSN as a DECIMAL(10,9) to preserve the leading zeros and not take the performance hit from a CHAR/VARCHAR. I am just wondering if anyone has tried this or if there is a reason why this would not work as expected. I know I could cast to INT and just lose the leading zeros but this seemed like a better solution to me.

ISNULL(TRY_CAST('.' + SSN AS DECIMAL(10,9)),0) AS DecimalSSN

I can always cast it back to a string with RIGHT(TRY_CAST(DecimalSSN AS VARCHAR),9) AS SSN

Was it helpful?

Solution

I wouldn't use a DECIMAL to store SSNs with rowstore or columnstore tables. The INT data type has the following advantages over DECIMAL:

  • It's generally faster for SQL Server to work with
  • It allows bitmap filters to be pushed down to the storage engine
  • If the column doesn't allow NULLs then it allows for a "perfect hashing function" which doesn't require a probe residual in joins.

If you need better performance for your SSN column I would use an INT with a leading 1. That preserves leading zeroes which seems to be desired. You should store all SSNs in the same format and only cast when necessary. For example if you need to display an SSN as a string to an end user then SELECT RIGHT(CAST(1012345678 AS INT), 9) returns "012345678". Otherwise work with the raw value.

I have no idea what your queries look like, but suppose you have 100k SSNs in a table and you need to check if any of those SSNs appear in a different table that has a billion rows. Here's what the query might look like:

SELECT *
FROM dbo.SSNS_TO_CHECK_3 c
WHERE NOT EXISTS (
    SELECT 1
    FROM dbo.ALL_SSNS_CCI_INT_LEADING_1 t
    WHERE c.SSN = t.SSN
)
OPTION (MAXDOP 1);

Using an INT column with a leading 1, the above query takes 7 seconds on my machine. With the DECIMAL(10, 9) format that you proposed the query takes 63 seconds my machine. Nearly all of the time is spent on the bitmap operator.

There are some additional considerations with columnstore but they don't matter here. INT is superior in every way that I know of compared to DECIMAL(10, 9).

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top