Question

I have a database with 30 million rows. The PK clustered index is a code generated GUID.

The table is as follows:

CREATE TABLE [dbo].[events](
    [imageEventGUID] [uniqueidentifier] NOT NULL,
    [imageSHAID] [nvarchar](256) NOT NULL,
    [queryGUID] [uniqueidentifier] NOT NULL,
    [eventType] [int] NOT NULL,
    [eventValue] [nvarchar](2050) NULL,
    [dateOfEvent] [datetime] NOT NULL,
 CONSTRAINT [PK_store_image_event] PRIMARY KEY CLUSTERED 
(
    [imageEventGUID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

Put simply its an image search engine.

  • imageEventGUID is code unique identifier,
  • imageSHAID is the SHA256 of the image URL
  • queryGUID is a code generated FK ( excluded from the create statement for brevity )
  • eventType is a number assigned to what type of event it is
  • eventValue is usually a URI of the image e.g. "http://mywebpage.com/images/image123456789.jpg"

Periodically I insert via SqlBulkCopy (from a DataTable) into this table using pretty standard code:

using (SqlBulkCopy bulk = new SqlBulkCopy(storeConn, SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls, null))
{
    bulk.DestinationTableName = "[dbo].[events]";
    bulk.WriteToServer(myeventsDataTable);
}

I'm typically trying to insert between 5k and 10k rows in one bulk insert. I'm having terrible insert results from this bulk copy. I used to run this DB on a SSD (only SATA 1 connected) and it was very fast (under 500 ms). I ran out of room on the SSD so I swapped the DB to a 1TB 7200 cache spinning disk, since then completion times are over 120 seconds (120000 MS). When the bulk insert is running I can see disk activity of around 1MB/sec, low CPU usage.

I have no other indexes on this table apart from the PK.

My questions to you are:

Can you see anything obvious that I am doing wrong which would cause this?

Is it just a case of 'your spinning disk is just not fast enough for a DB this size'?

What exactly is happening on the insert of this data? Because it is the clustered index is it re-arranging data pages on disk when an insert is made? It is trying to insert GUIDS which by nature are unordered and so it is possible that this 'random insert nature' is causing the read/write header to move around a lot to different pages on the disk?

Thanks for your time.

Was it helpful?

Solution

My guess is that the main issue is your choice of clustered index. The clustered index determines the physical order or records in the table. Since your PK is a Guid (which I'm assuming are generated randomly rather than sequentially) the database has to insert each row in the proper location, which will likely be between two existing records, which may cause page splits, fragmentation, etc.

As far as why it's faster on an SSD versus a magnetic drive, I'm no expert, but it's likely that the fragmentation process is faster on the SSD due to how it organizes the data. I/O throughput will be faster, but not by that magnitude.

If you can use a numeric autoincrement primary key instead of a GUID, then bulk inserts should be MUCH faster. You can still create unique indices on the GUID column to make queries faster.

OTHER TIPS

try to use a default constraint with newsequentialid() on the imageEventGUID column.

It will insert the GUIDs in the correct order, so SQL Server wont have to rearrange the table on each insert

GUID as a clustered primary key in itself is a horribly bad design choice - see Kim Tripp's blog post GUIDs as PRIMARY KEYs and/or the clustering key for explanations. Using a random (client-side generated) GUID will lead to very high (often 99% or more) fragmentation, and in the process of bulk inserting a lot of rows, it will cause tons of page splits which are very expensive operations.

If you can't change that - you can at least make sure that clustered index which will have horrible fragmentation values is being rebuilt every night - or even more frequently, if you can afford to.

And you could also keep your GUID column as the (non-clustered) primary key and introduce a new INT IDENTITY column to be used as the clustering key. That alone would already help quite a bit, I'm sure, by eliminating the outrageous fragmentation that the very random GUIDs will cause on your clustered index.

You can disable other indexes but not a clustered PK.
Well can disable a clustered PK but that disables the table.
If the data is not loading in the order of the PK then you will get rapid index fragmentation.
As fragmentation increases then insert speed decreases.

Understand you cannot control the GUID

But a few options.

Use a fill factor on [PK_store_image_event] of like 50, 20, or 10
This leaves space for inserts but at the cost of a larger index size on disk
Periodically rebuild the index - minimum nightly .

Can you sort the data prior to the load?
If so load sorted by the PK.
If you have the data in a DataTable then you can sort it.
You won't be to use your existing load code but you can sort it.
TVP is an option.

Use an iden for the PK and unique index on [imageEventGUID].
If it has a unique index it can be a FK.
Disable that index, load, then rebuild.
The rebuild will fail if you have a duplicate.

Or as a variation of the above just skip the iden PK.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top