Are duplicated NVARCHAR values stored as copies in SQL Server?

https://dba.stackexchange.com/questions/279876

10-03-2021
|

Question

I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains quite some redundancy.

So I was wondering if I need to normalize this myself by maintaining some sort of look-up table to address strings (note that if an address changes I need to maintain history - so it's not a matter of usual normalization), or if SQL Server is pointing to the same reference of the string behind the scenes. Or maybe it offers a column option to do so. Another approach that came into my mind is to use COMPRESS but I guess this does not make sense as the data itself (i.e. the address) is not long.

Reading/writing performance is not so much of a concern as the data will be accumulated over time.

Solution

Yes duplicated data is stored as copies in SQL Server

To change this behavior, you would need to implement PAGE COMPRESSION feature - create or rebuild indexes with (data_compression = on) option

It is a great feature that helps to save space. Once you enable it, SQL Server is pointing to the same reference of the string behind the scenes

Beware that PAGE COMPRESSION is not available in every SQL Server Edition, and it can have some CPU overhead

So you might want to make a lookup table if your edition does not allow for page compression

OTHER TIPS

Under "normal" conditions, no, data in VARCHAR and NVARCHAR columns is not de-duped (although duplicate attribute and/or element names in a single XML value are reduced to a unique instance).

Using one of the Data Compression options is probably your best bet. Here are some things to consider:

Unicode Compression (part of Row Compression) only works on NVARCHAR(1 - 4000), not NVARCHAR(MAX) (please vote for / support: Unicode compression NVARCHAR(MAX)).
Page Compression can work with NVARCHAR(MAX), but only for in-row data. Off-row data (LOB pages) is not compressed.

Since the data won't really be changing, you should look into the Columnstore Index options (also available in Azure SQL Database):

Columnstore compression should be better than Page compression.

Also, you should probably avoid using the SPARSE option, due to:

The SPARSE option offers no benefit compared to Data Compression.
It is mainly intended for columns sets / wide tables (i.e. up to 30,000 columns).
It mostly helps with fixed-length datatypes (e.g. INT, DATETIME). So, prior to Data Compression being available, it was useful for CHAR and NCHAR, but not for VARCHAR or NVARCHAR as they don't take up space when NULL.
It only benefits columns set to NULL.
It slightly hurts non-NULL values by adding 2 bytes to each one.
You should probably have NULL for 50% - 60% of the rows in order to get enough savings for it to be worth using this option.

For more details on working with character data, please see the following post of mine:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide

Compression is fine and probably useful in your case, but you should normalize your tables anyway and try to achieve a structure where you're only storing one copy of the same address. This will lead to less redundancy, and lighten the primary table you're currently asking about.

Another two things to consider too are:

NVARCHAR can use twice as much space as a VARCHAR of equally used length (Solomon Rutzky's answer links a good article regarding this), so it can be more data heavy. If you can use VARCHAR instead, you can probably drop the length down to something much more reasonable for an address field and save a lot more space. Here's an article comparing the two: SQL varchar data type deep dive
You should also look into sparse columns which can save you a significant amount of space too when used correctly. Here's the Microsoft docs: Use Sparse Columns

(Important to note that a minimum amount of NULL values need to be present for sparse columns to be worth looking into.)

Just a note on page compression: This type of compression is limited on what's on a page. I.e., you won't see any reduction of duplicates values that ends up on different pages.

Also, page compression was made available in lower editions as of SQL Server 2016 sp1; and from what I can see this also applies to Azure SQL Database according to this.

I'm not sure of your design, but my recommendation is that you store the address in a separate table from the customer table. The association between the two tables will depend on the business logic (e.g. is an address associated with one or many customers). My experience of putting an entire address in one column makes it more difficult; for example, finding all customers in a state or country involves a substring search.

The other advantages are that maintaining address history is much simpler (you can have a revision number in the address table or link table), validity date range, and you can manage security more efficiently.

From a performance point view, large columns make for bloated rows. Bloated rows can reduce the performance; for example, fewer rows can be transferred per I/O operation. The definition of bloated varies on which DBMS you use, how it is configured, the OS, etc. I try to keep my table rows less than 8K in size, which well within the performance envelope.

Aleksey has answered your question very well - however I'm not entirely sure what your end goal is.

If you have a very limited amount of space to store this database then compression will help with this - but compressing strings (such as an address) will net you a very small amount of space saved in modern terms.

You've alluded to there being repeated data for your customers - if I were to guess I'd say it looks like you're creating some sort of transaction log for purchase/transactions? In that case, you would likely entirely omit the customers information from this table and simple use a reference to your customers' table object on each row, with a possible "override" for columns indicating something was non-standard about an order/transaction (like the customer used a difference address from their default).

I appreciate you may have security concerns however there may be better, more specific answers to your problem if you can provide more information on what you're trying to achieve. As it stands Aleksey has described one feature you asked about very well - whether that's the best feature for you isn't guaranteed.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange