Question

I have a "comment"-type column that is rarely used -- around 6% non-null in a population of 3 million records. The average length (when used) is 6 characters, and the max so far is around 3KB. A maximum of 4000 characters is reasonable for this field. I have two options:

comments varchar(max) NULL -- this is the current column definition
comments varchar(4000) SPARSE NULL

My current understanding is that in both cases, a NULL value would require no storage -- just the column's NULL bit set and a length of 0 in the row metadata.

But for the non-null cases, does one have a clear advantage over the other?

The extra 4-byte pointer for sparse columns with values suggests they are always stored off-row like text or very large varchar(max) fields. Is that the case?

If so, I'd lean toward using varchar(max), since it only stores values off-row if the total row length exceeds 8KB, and the majority of my values are short and unlikely to put a row over the limit.

I haven't seen this particular situation addressed in the BOL, so I'm hoping someone here knows enough about the innards of MSSQL to give some insight.

(If it matters, I'm currently using 2008R2, but hoping to upgrade soon to 2014.)

Was it helpful?

Solution

There is no advantage for the non-NULL cases when using SPARSE, and in fact, there are two stated disadvantages:

  • an extra 4 bytes per each non-NULL value
  • slightly longer access time

As you pretty much already gathered, the SPARSE option only makes sense for fixed-length datatypes; I can't think of a single reason to use it on variable-length types.

I am not sure that the extra 4 bytes implies anything about it being stored off-row, and the MAX types aren't entirely off-row when they exceed 8000 bytes as there is then the 16-byte pointer in the row to that off-row location.

Stick with VARCHAR(4000), no SPARSE, and I would even consider making it NOT NULL DEFAULT('') (an empty string is still 0 bytes, but now you don't need to mess with the NULL indicator, and can a comment really be "unknown" as opposed to "no comment"?).

OTHER TIPS

I second Srutzky. Agreed.

Now, let me add a little bit of operational perspective that has a lot to do with your decision. Since you are currently on varchar(max) apprently there's no issue for you, but getting away from it has certain advantages in performance and operational capabilities.

Just to give you one example, there is a useful feature called Online Index Rebuild, that is an enterprise-edition only feature.

Allow me to sidetrack a little; After a long period of use, indexes become fragmented and need to be rebuilt. However, the usual builds would cause significant locking on underlying tables and while the index is being rebuilt, the index is not usable, which makes queries dead in water on very large databases. It's not just "hmm..it's kinda slow", it is "2 seconds query takes 25 minutes!" kind of emergency. So, in a 24/7 system it's not an option. That's where online index rebuild comes into play; if you paid $25,000 or so for core license of the privilege of using the Enterprise Edition, you can magically rebuild index on a 24/7 system without impacting users.

Except, if some developer threw in varchar(max) it won't work. It happily would, however, on varchar(4000). If the data contained over 8000 characters, you would be stuck in varchar(max) and unable to perform online rebuild, which would be an operational issue the higher-ups would surely notice.

..and that is just one example. So my recommendation is to talk to production DBA in your organization and and ask them what they like and don't. Since you are currently running varchar(max) I take it that it's not an issue, but you can future-proof it by removing it. Although, you would be perfectly fine with the use of varchar(max) if the table is for infrequently accessed storage with no need for online index rebuild. This is the sort of call only your production DBA can make.

If you are in a smaller shop with no dedicated DBA, provide more detail about the use of the table and operational requirements (Is it 24/7 with five-nines requirement? Is it clustered? How long is your service window? Current edition and future plans for edition changes?) and the community can give you better recommendations. I may be asking too much info, but then it's the sort of detail experienced DBAs count on to make the right call.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top