Question

Thanks to the wonderful article The Cost of GUIDs as Primary Keys, we have the COMB GUID. Based on current implementation, there are 2 approaches:

  1. use last 6 bytes for timestamp: GUIDs as fast primary keys under multiple databases
  2. use last 8 bytes for timestamp by using windows tick: GUID COMB strategy in EF4.1 (CodeFirst)

We all know that for 6 bytes timestamp at GUID, there would more bytes for random bytes to reduce the collision of the GUID. However more GUID with same timestamp would be created and those are not sequential at all. With that, 8 bytes timestamp would be preferred.

So it seems a hard choice. Based on article above GUIDs as fast primary keys under multiple databases, it says:

Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential. This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated. However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window. This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well

Just wonder if someone who knows database internal could share some lights about above observation. Is it because that database server just store the data in the memory and only write to disk when it reaches certain threshold? Thus the reorder of inserted data with non sequence GUID with same time stamp would happen in general in memory and thus minimal performance penalty.

Update: Based on our testing, the COMB GUID could not reduce the table fragmentation as it is claimed over the internet compared with random GUID. It seems the only way right now is to use SQL Server to generate the sequential GUID.

Was it helpful?

Solution

The article you referenced is from 2002 and is very old. Just use newsequentialid (available in SQL Server 2005 and up). This guarantees that each new id you generate is greater than the previous one, solving the index fragmentation/page split issue.

Another aspect I'd like to mention, though, that the writer of that article glossed over, is that using 16 bytes when you only need 4 is not a good idea. Let's say you have a table with 500,000 rows averaging 150 bytes not including the clustered column, and the table has 3 nonclustered indexes (which repeat the clustered column in each row), each in turn with rows averaging 4 bytes, 25 bytes, and 50 bytes not counting the clustered column.

The storage requirements at perfect 100% fill factor are then (all numbers in megabytes except where %):

Item  Clust  50     25     4      Total
----  -----  -----  -----  -----  ------
GUID  79.1   31.5   19.6    9.5   139.7
 int  73.4   25.7   13.8    3.8   116.7
%imp   7.2%  18.4%  29.6%  60.0%   16.5%

In the nonclustered index having just one int column of 4 bytes (a common scenario), switching the clustered index to an int makes it 60% smaller! This translates directly into a 60% performance improvement for any scans on the table--and that's conservative, because with smaller rows, page splits will occur less often and the fragmentation will stay better longer.

Even in the clustered index itself, there's still a 7.2% performance improvement, which is not nothing, at all.

What if you used GUIDs throughout your entire database, which had tables with a similar profile as this where switching to int would yield a 16.5% reduction in size, and the database itself was 1.397 Terabytes in size? Your whole database would be 230 Gb larger (refer to the Total column, 139.7 - 116.7). That translates into real money in the real world for high-availability storage. It moves your disk purchase schedule earlier in time which is harmful to your company's bottom line.

Do not use larger data types than necessary, ever. It's like adding weight to your car for no reason: you will pay for it (if not in speed, then in fuel economy).

UPDATE

Now that I know you are creating the GUID in your client-side code, I can see more clearly the nature of your problem. If you are able to defer creating the GUID until row insertion time, here's one way to accomplish that.

First, set a default for your CustomerID column:

ALTER TABLE dbo.Customer ADD CONSTRAINT DF_Customer_CustomerID
   DEFAULT (newsequentialid()) FOR Customer;

Now you don't have to specify what value to insert for CustomerID in any INSERT, and your query could look like this:

DECLARE @Name varchar(100) = 'Acme Spy Devices';
INSERT dbo.Customer (Name)
OUTPUT inserted.CustomerID -- a GUID
VALUES (@Name);

In this very simple example, you have inserted a new row to the Customer table, and returned a rowset to the client containing the just-created value, all in one query.

If you wanted to explicitly insert VALUES (newsequentialid(), @Name) that would work, too.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top