문제

I have several tables where records can be uniquely identified with several broad business fields. In the past, I've used these fields as a PK, with these benefits in mind:

  • Simplicity; there are no extraneous fields and just one index
  • Clustering allows for fast merge joins and range-based filters

However, I've heard a case made for creating a synthetic IDENTITY INT PK, and instead enforcing the business key with a separate UNIQUE constraint. The advantage is that the narrow PK makes for much smaller secondary indices.

If a table has no indices other than the PK, I don't see any reason to favor the second approach, though in a large table it's probably best to assume that indices may be necessary in the future, and therefore favor the narrow synthetic PK. Am I missing any considerations?

Incidentally, I'm not arguing against using synthetic keys in data warehouses, I'm just interested in when to use a single broad PK and when to use a narrow PK plus a broad UK.

도움이 되었습니까?

해결책

There is no significant disadvantage using the natural key as the clustered index

  • there are no non-clustered indexes
  • no foreign keys referencing this table (it is a parent row)

The downside would be increased page splits as data inserts would be distributed throughout the data, instead of at the end.

Where you do have FKs or NC indexes, the using a narrow, numeric, increasing clustered index has advantages. You only repeat a few bytes of data per NC or FK entry, not the while business/natural key.

As to why, read the too 5 articles from Google

Note I avoided the use of "primary key".

You can have the clustered index on the surrogate key but keep the PK on the business rules but as non-clustered. Just make sure the clustered is unique becauuse SQL will add a "uniquifier" to make it so.

Finally, it may make sense to have a surrogate key but not blindly on every table: many-many tables do not need one, or where a compound key from the parent tables will suffice

다른 팁

Although I risk stating the obvious, an index on a surrogate key (an id number) is useful if you need to locate things by their id number. Users are not going to deal with the id number; they're going to deal with human-readable text. So you have to pass around the text and its id number a lot, so the user interface can display the text and operate on the id number.

The dbms will use that kind of index to support foreign keys, if you define them that way.

You can sometimes improve performance by using id numbers as foreign keys, but it's not an absolute improvement. On our OLTP system, foreign keys using natural keys outperformed foreign keys using id numbers on a test suite of about 130 (I think) representative queries. (Because the important information is often carried in the keys, using the natural keys avoided a lot of joins.) The median speedup was a factor of 85 (joins using id numbers took 85 times longer to return rows).

Tests showed that joins on id numbers wouldn't perform faster than reads on natural keys in our database until certain tables reached many millions of rows. The width of the row has a lot to do with that--wider rows mean fewer rows fit on a page, so you have to read more pages to get 'n' rows. Almost all our tables are in 5NF; most tables are fairly narrow.

By the time joins start to out perform simple reads here, putting critical tables and indexes on a solid state disk might level the performance into the hundreds of millions of rows.

I have a whole oltp database designed using identity columns for clustering + pk. It work pretty fast on insert/seeks but i've seen a few problems:
1. the index fill option is useless because the inserts happen only to the end of the index
2. more storage space. I have tables with tens of millions of records and 1 int takes up space by itself. Each table with an identity column for it's pk has to have another index for business seeks, so even more storage required.
3. scalability. This is the worst problem. Because every insert goes to the end of the index, each insert will stress only the end of the index (allocation, io for writes, etc). By using a business key as a clustering key you can distribute the inserts evenly on the index. That means that you just eliminated a big hotspot. You can easily use more files for an index, each file on a separate drive, each drive working separately.

I started changing my tables from an identity columns to natural keys (maybe separate for clustering & pk). It just feels better now.

I would suggest the following (at least for an oltp db):
1. use as a clustering key the right columns in the right order as to optimize the most frequent queries
2. use a pk the right columns that make sense for you table

If the clustered key is not simple and contains chars (char[], varchar, nvarchar), i think the answer is 'it depends', you should analyse individually each case.

I keep the following principle: optimize for the most common query while minimizing the worst case scenario.

I almost forgot one example. I have some tables that reference themselves. If that table has an identity column for it's primary key, then inserting one row might require an update, and inserting more than one row at a time might be difficult if not impossible (it depends on the table design).

From a performance point of view the choice of which key is the "primary" key makes no difference at all. There is no difference between using a PRIMARY KEY and a UNIQUE constraint to enforce your keys.

Performance is determined by the selection and type of indexes and other storage options and by the way the keys are used in queries and code.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 dba.stackexchange
scroll top