Index Uniqueness Overhead

https://dba.stackexchange.com/questions/242908

06-02-2021
|

Question

I've been having an ongoing debate with various developers in my office on the cost of an index, and whether or not uniqueness is beneficial or costly (probably both). The crux of the issue is our competing resources.

Background

I have previously read a discussion that stated a Unique index is no additional cost to maintain, since an Insert operation implicitly checks for where it fits into the B-tree, and, if a duplicate is found in a non-unique index, appends a uniquifier to the end of the key, but otherwise inserts directly. In this sequence of events, a Unique index has no additional cost.

My coworker combats this statement by saying that Unique is enforced as a second operation after the seek to the new position in the B-tree, and thus is more costly to maintain than a non-unique index.

At worst, I have seen tables with an identity column (inherently unique) that is the clustering key of the table, but explicitly stated as non-unique. On the other side of worst is my obsession with uniqueness, and all indexes are created as unique, and when not possible to define an explicitly unique relation to an index, I append the PK of the table to the end of the index to ensure the uniqueness is guaranteed.

I'm frequently involved in code reviews for the dev team, and I need to be able to give general guidelines for them to follow. Yes, every index should be evaluated, but when you have five servers with thousands of tables each and as many as twenty indexes on a table, you need to be able to apply some simple rules to ensure a certain level of quality.

Question

Does uniqueness have an additional cost on the back-end of an Insert compared to the cost of maintaining a non-unique index? Secondly, what is wrong with appending the Primary Key of a table to the end of an index to ensure uniqueness?

Example Table Definition

create table #test_index
    (
    id int not null identity(1, 1),
    dt datetime not null default(current_timestamp),
    val varchar(100) not null,
    is_deleted bit not null default(0),
    primary key nonclustered(id desc),
    unique clustered(dt desc, id desc)
    );

create index
    [nonunique_nonclustered_example]
on #test_index
    (is_deleted)
include
    (val);

create unique index
    [unique_nonclustered_example]
on #test_index
    (is_deleted, dt desc, id desc)
include
    (val);

Example

An example of why I would add the Unique key to the end of an index is in one of our fact tables. There is a Primary Key that is an Identity column. However, the Clustered Index is instead the partitioning scheme column, followed by three foreign key dimensions with no uniqueness. Select performance on this table is abysmal, and I frequently get better seek times using the Primary Key with a key lookup rather than leveraging the Clustered Index. Other tables that follow a similar design, but have the Primary Key appended to the end have considerably better performance.

-- date_int is equivalent to convert(int, convert(varchar, current_timestamp, 112))
if not exists(select * from sys.partition_functions where [name] = N'pf_date_int')
    create partition function 
        pf_date_int (int) 
    as range right for values 
        (19000101, 20180101, 20180401, 20180701, 20181001, 20190101, 20190401, 20190701);
go

if not exists(select * from sys.partition_schemes where [name] = N'ps_date_int')
    create partition scheme 
        ps_date_int
    as partition 
        pf_date_int all 
    to 
        ([PRIMARY]);
go

if not exists(select * from sys.objects where [object_id] = OBJECT_ID(N'dbo.bad_fact_table'))
    create table dbo.bad_fact_table
        (
        id int not null, -- Identity implemented elsewhere, and CDC populates
        date_int int not null,
        dt date not null,
        group_id int not null,
        group_entity_id int not null, -- member of group
        fk_id int not null,
        -- tons of other columns
        primary key nonclustered(id, date_int),
        index [ci_bad_fact_table] clustered (date_int, group_id, group_entity_id, fk_id)
        )
    on ps_date_int(date_int);
go

if not exists(select * from sys.objects where [object_id] = OBJECT_ID(N'dbo.better_fact_table'))
    create table dbo.better_fact_table
        (
        id int not null, -- Identity implemented elsewhere, and CDC populates
        date_int int not null,
        dt date not null,
        group_id int not null,
        group_entity_id int not null, -- member of group
        -- tons of other columns
        primary key nonclustered(id, date_int),
        index [ci_better_fact_table] clustered(date_int, group_id, group_entity_id, id)
        )
    on ps_date_int(date_int);
go

Solution

I'm frequently involved in code reviews for the dev team, and I need to be able to give general guidelines for them to follow.

The environment I'm currently involved in has 250 servers with 2500 databases. I've worked on systems with 30,000 databases. Guidelines for indexing should revolve around the naming convention, etc, not be "rules" for what columns to include in an index - every individual index should be engineered to be the correct index for that specific business rule or code touching the table.

Does uniqueness have an additional cost on the back-end of an Insert compared to the cost of maintaining a non-unique index? Secondly, what is wrong with appending the Primary Key of a table to the end of an index to ensure uniqueness?

Adding the primary key column to the end of a non-unique index to make it unique looks to me to be an anti-pattern. If business rules dictate the data should be unique, then add a unique constraint to the column; which will automatically create a unique index. If you're indexing a column for performance, why would you add a column to the index?

Even if your supposition that enforcing uniqueness doesn't add any extra overhead is correct (which it isn't for certain cases), what are you solving by needlessly complicating the index?

In the specific instance of adding the primary key to the end of your index key so that you can make the index definition include the UNIQUE modifier, it actually makes zero difference to the physical index structure on disk. This is due to the nature of the structure of B-tree indexes keys, in that they always need to be unique.

As David Browne mentioned in a comment:

Since every nonclustered index is stored as unique index, there is no extra cost in inserting into a unique index. In fact the only extra cost would in failing to declare a candidate key as a unique index, which would cause the clustered index keys to be appended to the index keys.

Take the following minimally complete and verifiable example:

USE tempdb;

DROP TABLE IF EXISTS dbo.IndexTest;
CREATE TABLE dbo.IndexTest
(
    id int NOT NULL
        CONSTRAINT IndexTest_pk
        PRIMARY KEY
        CLUSTERED
        IDENTITY(1,1)
    , rowDate datetime NOT NULL
);

I'll add two indexes that are identical except for the addition of the primary key at the tail end of the second indexes key definition:

CREATE INDEX IndexTest_rowDate_ix01
ON dbo.IndexTest(rowDate);

CREATE UNIQUE INDEX IndexTest_rowDate_ix02
ON dbo.IndexTest(rowDate, id);

Next, we'll several rows to the table:

INSERT INTO dbo.IndexTest (rowDate)
VALUES (DATEADD(SECOND, 0, GETDATE()))
     , (DATEADD(SECOND, 0, GETDATE()))
     , (DATEADD(SECOND, 0, GETDATE()))
     , (DATEADD(SECOND, 1, GETDATE()))
     , (DATEADD(SECOND, 2, GETDATE()));

As you can see above, three rows contain the same value for the rowDate column, and two rows contain unique values.

Next, we'll look at the physical page structures for each index, using the undocumented DBCC PAGE command:

DECLARE @dbid int = DB_ID();
DECLARE @fileid int;
DECLARE @pageid int;
DECLARE @indexid int;

SELECT @fileid = ddpa.allocated_page_file_id
    , @pageid = ddpa.allocated_page_page_id
FROM sys.indexes i 
CROSS APPLY sys.dm_db_database_page_allocations(DB_ID(), i.object_id, i.index_id, NULL, 'LIMITED') ddpa
WHERE i.name = N'IndexTest_rowDate_ix01'
    AND ddpa.is_allocated = 1
    AND ddpa.is_iam_page = 0;

PRINT N'*************************************** IndexTest_rowDate_ix01 *****************************************';
DBCC TRACEON(3604);
DBCC PAGE (@dbid, @fileid, @pageid, 1);
DBCC TRACEON(3604);
PRINT N'*************************************** IndexTest_rowDate_ix01 *****************************************';

SELECT @fileid = ddpa.allocated_page_file_id
    , @pageid = ddpa.allocated_page_page_id
FROM sys.indexes i 
CROSS APPLY sys.dm_db_database_page_allocations(DB_ID(), i.object_id, i.index_id, NULL, 'LIMITED') ddpa
WHERE i.name = N'IndexTest_rowDate_ix02'
    AND ddpa.is_allocated = 1
    AND ddpa.is_iam_page = 0;

PRINT N'*************************************** IndexTest_rowDate_ix02 *****************************************';
DBCC TRACEON(3604);
DBCC PAGE (@dbid, @fileid, @pageid, 1);
DBCC TRACEON(3604);
PRINT N'*************************************** IndexTest_rowDate_ix02 *****************************************';

I've looked at the output using Beyond Compare, and except for obvious differences around the allocation page IDs, etc, the two index structures are identical.

You might take the above to mean that including the primary key in every index, and defining at as unique is A Good Thing™ since that's what happens under-the-covers anyway. I wouldn't make that assumption, and would suggest only defining an index as unique if in fact the natural data in the index is unique already.

There are several excellent resources in the Interwebz about this topic, including:

FYI, the mere presence of an identity column does not guarantee uniqueness. You need to define the column as a primary key or with a unique constraint to ensure the values stored in that column are in fact unique. The SET IDENTITY_INSERT schema.table ON; statement will allow you to insert to non-unique values into a column defined as identity.

OTHER TIPS

Just an add-on to Max's excellent answer.

When it comes to creating a non unique clustered index, SQL Server creates something called a Uniquifier in the background anyways.

This Uniquifier could cause potential problems in the future if your platform has a lot of CRUD operations, since this Uniquifier is only 4 bytes big (a basic 32bit integer). So, if your system has a lot of CRUD operations it's possible you will use up all the available unique numbers and all of a sudden you will receive an error and it won't allow you to insert anymore data into your tables (because it will no longer have any unique values to assign to your newly inserted rows).

When this happens, you will receive this error:

The maximum system-generated unique value for a duplicate group 
was exceeded for index with partition ID (someID). 

Dropping and re-creating the index may resolve this;
otherwise, use another clustering key.

Error 666 (the above error) occurs when the uniquifier for a single set of non-unique keys consumes more than 2,147,483,647 rows.

So, you'll need to have either ~2 billion rows for a single key value, or you'll need to have modified a single key value ~2 billion times to see this error. As such, it's not extremely likely you'll run into this limitation.

I'm not going to weigh in on the question of whether an index should be unique or not, and whether there's more overhead in this approach or that. But a couple of things bothered me in your general design

dt datetime not null default(current_timestamp). Datetime is an older form or this, and you may be able to achieve at least some space savings by using datetime2() and sysdatetime().
create index [nonunique_nonclustered_example] on #test_index (is_deleted) include (val). This bothers me. Take a look at how the data is to be accessed (I'm betting there's more than WHERE is_deleted = 0) and look at using a filtered index. I would even consider using 2 filtered indexes, one for where is_deleted = 0 and the other for where is_deleted = 1

Fundamentally this looks more like a coding exercise designed to test a hypothesis rather than a real problem / solution, but those two patterns are definitely something I look for in code reviews.

It looks like your simply using PK to make an alternate, smaller index. Hence, performance on it is faster.

You see this at companies that have massive data tables (eg: master data tables). Someone decides to have one massive clustered index on it expecting it to fill the needs of various reporting groups.

But, one group may need only a few parts of that index while another group needs other parts.. so the index just slapping in every column under the sun to "optimize performance" doesn't really help.

Meanwhile, breaking it down to create multiple, smaller, targeted indices, often solves the problem.

And, that seems to be what you're doing. You have this massive clustered index with awful performance, then you're using PK to create another index with fewer columns that (no surprise) has better performance.

So, just do an analysis and figure out if you can take the single clustered index and break it down into smaller, targeted indices that specific jobs need.

You would have to analyze performance then from a "single index vs. multiple index" stand-point, because there's overhead in making and updating indices. But, you have to analyze this from an overall perspective.

EG: it may be less resource-intensive to one massive clustered index, and more resource intensive to have several smaller targeted indices. But, if you're then able to run targeted queries at the back-end much quicker, saving time (and money) there, it might be worth it.

So, you'd have to do end-to-end analysis.. not just look at how it impacts your own world, but also how it impacts end-users.

I just feel like you're mis-using the PK identifier. But, you may be using a database system that only allows 1 index (?), but you can sneak another in if you PK (b/c every relational database system these days seems to automatically index the PK). However, most modern RDBMS' should allow multiple index creation; there should be no limit to the number of indexes you can make (as opposed to a limit of 1 PK).

So, by making a PK whicih just acts like an alt index.. you're using up your PK, which may be needed if the table is later expanded in it's role.

That's not to say your table doesn't need a PK.. SOP DB's 101 says "every table should have a PK". But, in a data-warehousing situation or such.. having a PK on a table may just be extra overhead that you don't need. Or, it could be a god-send to ensure you're not double-adding dupe entries. It's really a matter of what you're doing and why you're doing it.

But, massive tables defnitely benefit from having indexes. But, assuming a single massive clustered index will be best is just ... it may BE the best.. but I'd recommend testing out on a test env breaking the index apart into multiple smaller indices targeting specific use-case scenarios.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange