25k Inserts Daily, 99% Fragmentation on Clustered GUID Index

https://stackoverflow.com/questions/8896638

17-04-2021
|

Question

I have a table that has a primary key as a clustered GUID field; I'm generating GUIDs using NEWSEQUENTIALID() instead of NEWID. Unfortunately, because this table sees ~25k-100k inserts per day, within a few hours the (default: clustered) primary key index becomes 99% fragmented.

I originally used NEWID instead of generating sequential IDs, but even when I recreated the table and reinserted all the rows using NEWSEQUENTIALID (and specified that as the default value for the primary key column), I still see fragmentation in the order of 99% within a few hours. (The table currently has about 1.3 million records in it.

I had thought about replacing the GUID with an integer primary key, but I'm not sure if that will work; plus, since our team uses GUIDs for primary keys instead of integers going forward, I don't think I'll have enough buy-in to do this.

What are my options to keep this thing defragmented? I'm using SQL Server Express, so I don't have access to SQL Agent (and cannot, therefore, run a maintenance plan regularly to rebuild the index).

I also may very likely be splitting this database/table at some point in the future (because of the volume of data), so I will likely need GUIDs to merge the tables.

Also: I cannot use an indexed view, because I have an inner-select which would be difficult for me to unwind into a join.

Solution

In my own personal experience, tossing out GUIDs as your clustering key can have major, positive effects on your system - especially on index fragmentation!

My new INT IDENTITY clustering indices have hardly any fragmentation - even after months of intense, daily production use. Definitely worth it!!

Using the Guid datatype as clustering key in SQL Server is a horribly bad choice - whichever way you look at it...

See some of Kimberly Tripp's (Queen of Indexing) blog post on the topic:

and anything else she's blogged on the topic of clustering keys....

OTHER TIPS

This is expected behavior for Guid indexes with large numbers of inserts. Most of the time you choose guids as keys only because the records are generated by multiple sources and you need to make individual sources don't step on each other's toes. An example here would be offline mobile devices. A worker in the field needs to create a new record when not connected, and so the mobile device can safely create the record with a guid as the key. When back online later on, the device can sync with the database safely without worrying about any key collisions.

If you're generating the guids on a single server, you're normally better off using a simple identity column. If you really want the guids, you can still include them... you just might want to think twice about using them for your clustered index. The only reason you might want to cluster on guid is if later on you'll come back to the table and query for one record at a time based on it's guid. That seems unlikely with the insert rate you're seeing. But, if it's the case, you can help mitigate things by reducing the fill factor on the index. This will increase the amount of disk space used (and mean more disk seeks later on), but pages will fill up faster and you'll avoid some of the index re-shuffling.

Another option to look at here if an integer-based key is out of the question is a sequential guid. This still provides uniqueness, while also creating less fragmentation.

I fully realize that I'm necro'ing a thread that was started over 8 years ago (at the time of this writing) but there's some serious misconceptions about NEWID(), NEWSEQUENTIALID(), "Ever-increasing-INTs", and a thing that I simply refer to as "ExpAnsive Updates" (with an "A"), which are really ExpEnsive (with an "E").

Let's cover the latter first, which is probably the real problem the OP is having...

With only a tiny difference, which doesn't matter much when it comes to the unwanted creation of page splits and the resulting fragmentation, NEWSEQUENTIALID and "Ever-increasing INTs" both work the same way... by themselves, they only create "good" page splits (which are also "bad" but that's a subject for a different discussion). So, with reference to the originally posted question where the Op stated that switching from the completely random NEWID to the "ever-increasing" NEWSEQUENTIALID seemed to make no difference in the amount of fragmentation that was being created.

The reason for that isn't the fact that NEWSEQUENTIALID has a problem (it doesn't). The fragmentation problem is most likely that the new rows are being inserted (which will cause NO fragmentation with NEWSEQUENTIALID) and then those new rows suffer another process to update them. If the updates are "ExpAnsive" where some variable width column in a row becomes wider, then that will cause massive page splits. This will happen even if you build the index with a fairly low FILL FACTOR because INSERTS DON'T STOP INSERTING INTO PAGES JUST BECAUSE THEY REACH THE FILL FACTOR. Instead, a large number of inserts will insert into pages until they're nearly 100% full (depending on the row count per page which depends on the width of the rows being inserted) and then create a new page using a "Good" page split with virtually no fragmentation as surely as if you were using an ever-increasing integer.

So you insert all of these rows in contiguous pages and they're filled to as close to 100% as they can get. Everything is fine... no fragmentation. But then you do "post insert processing" that updates the rows you just inserted. If the size of the rows are increased due to "ExpAnsive" then KAAAA-BOOOOOM!!! All of those totally full pages all end up splitting.

One of the most common sources of such expansion is when people use "poor man's auditing" and they have a "Modified_BY" column that goes from NULL to some value. There are many ways to fix that particular problem but, again, well beyond the scope of this thread and post.

Shifting gears to random GUIDs generated by NEWID()... there are a whole lot of reasons to not use them but, totally contrary to what you've been made to believe, fragmentation actually isn't one of them. I've given several presentations in a very "Alice's Restaurant Fashion" (lots of graphics and notations on the graphics) that prove it. To make a more than 1 hour presentation suitable for this post, I'll tell you that it all boils down to several small but deadly mistakes that people keep making...

They keep using REORGANIZE because it's supposedly a "Best Practice" is the main problem. They don't realize that REORGANIZE doesn't actually work on GUIDs they way they think it might. Instead of providing extra space on the pages, it actually removes the extra space and, THAT, my fellow index wranglers, actually PERPETUATES the fragmentation of GUIDs. YOU MUST NOT USE REORGANZE WHEN DOING INDEX MAINTENANCE ON RANDOM GUIDS! PERIOD!!! Not even if you're using the Express or Standard Editions. If you don't have the time, resources, or disk space to REBUILD them, it's actually better to not do any index maintenance on random GUIDs than it is to do it wrong by using REORGANZE. Wait until you can do a REBUILD.
You MUST set a lower FILL FACTOR on random GUID keyed indexes. Leaving them at "0" is almost as bad as REORGANIZEing them. Depending, of course, on how wide the rows of the index are, how many are inserted per day, and how long you want to go with absolutely ZERO page splits (not even supposed "good" ones!!!) on the random GUID indexes, I tell people to set their FILL FACTOR to 71, 81, or 91. The reason why I make all of those end with a "1" is because of the final thing you need to fix for random GUIDs when "ExpAnsive" updates aren't present, which is item #3 below.
You MUST check indexes that are based on random GUIDs every bloody night. The reason I chose to give them all FILL FACTORs ending in "1" is because THAT is what you're looking for as a % of logical fragmentation. Just as soon as they go over the 1% mark, you MUST REBUILD them because nearly every page in the entire index is at the point where it's going to split. (I call these "Low Threshold Rebuilds"). Now, don't get confused. If everything is setup correctly and there are no "ExpAnsive" updates, your GUID keyed Clustered Indexes can go weeks with NO PAGE SPLITs or the related fragmentation and your much more narrow non-clustered indexes can quite literally go for MONTHS with absolutely no fragmentation!

The other big mistake is, of course, "ExpAnsive" updates. Those will kill just about everything but, surprisingly, random GUIDs will actually weather such an onslaughyt a whole lot better than most anything else using the same steps as above.

What you really need to do is fix the "ExpAnsive" updates so that they are no longer "ExpAnsive". Like I said, that's a whole 'nuther subject that's way to long for this post.

have a look at this simple query in the stackexchange data explorer. It looks like newsequentialid() increments the most significant portion of the guid, rather than the least. This would be a likely cause of the fragmentation you are seeing.

If you must use guids, perhaps look into generating them via code and sending them through in your insert statements, rather than relying on the database generating them. Use the "comb" technique, to use the current timestamp as part of the guid, incrementing in the least significant digits.

edit

.. or if you don't want to generate them in code, you could do something like this within the database

CAST(CAST(NEWSEQUENTIALID() AS BINARY(10)) + CAST(GETDATE() AS BINARY(6)) AS UNIQUEIDENTIFIER)

as your default value, as per this modification to the above query

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow