Question

I have a database table with 5 million rows. The clustered index is auto-increment identity column. There PK is a code generated 256 byte VARCHAR which is a SHA256 hash of a URL, this is a non-clustered index on the table.

The table is as follows:

CREATE TABLE [dbo].[store_image](
    [imageSHAID] [nvarchar](256) NOT NULL,
    [imageGUID] [uniqueidentifier] NOT NULL,
    [imageURL] [nvarchar](2000) NOT NULL,
    [showCount] [bigint] NOT NULL,
    [imageURLIndex]  AS (CONVERT([nvarchar](450),[imageURL],(0))),
    [autoIncID] [bigint] IDENTITY(1,1) NOT NULL,
 CONSTRAINT [PK_imageSHAID] PRIMARY KEY NONCLUSTERED 
(
    [imageSHAID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX [autoIncPK] ON [dbo].[store_image] 
(
    [autoIncID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO
  • imageSHAID is a SHA256 hash of an image URL e.g. "http://blah.com/image1.jpg", it is hashed into a varchar of 256 length.

  • imageGUID is a code generated guid in which I identify the image (it will be used as an index later, but for now I have omitted this column as an index)

  • imageURL is the full URL of the image (up to 2000 characters)

  • showCount is the number of times the image is shown, this is incremented each time this particular image is shown.

  • imageURLIndex is a computed column limited by 450 characters, this allows me to do text searches on the imageURL should I choose to, it is indexable (again index is omitted for brevity)

  • autoIncID is the clustered index, should allow faster inserting of data.

Periodically I merge from a temp table into the store_image table. The temp table structure is as follows (very similar to the store_image table):

CREATE TABLE [dbo].[store_image_temp](
    [imageSHAID] [nvarchar](256) NULL,
    [imageURL] [nvarchar](2000) NULL,
    [showCount] [bigint] NULL,
) ON [PRIMARY]

GO

When the merge process is run, I write a DataTable to the temp table using the following code:

using (SqlBulkCopy bulk = new SqlBulkCopy(storeConn, SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls, null))
{
    bulk.DestinationTableName = "[dbo].[store_image_temp]";
    bulk.WriteToServer(imageTableUpsetDataTable);
}

I then run the merge command to update the showCount in the store_image table by merging from the temp table based on the imageSHAID. If the image doesn't currently exist in the store_image table, I create it:

merge into store_image as Target using [dbo].[store_image_temp] as Source
on Target.imageSHAID=Source.imageSHAID 
when matched then update set 
Target.showCount=Target.showCount+Source.showCount 
when not matched then insert values (Source.imageSHAID,NEWID(), Source.imageURL, Source.showCount);

I'm typically trying to merge 2k-5k rows from the temp table to the store_image table at any one merge process.

I used to run this DB on a SSD (only SATA 1 connected) and it was very fast (under 200 ms). I ran out of room on the SSD so I swapped the DB to a 1TB 7200 cache spinning disk, since then completion times are over 6-100 seconds (6000 - 100000MS). When the bulk insert is running I can see disk activity of around 1MB-2MB/sec, low CPU usage.

Is this a typical write time for this amount of data? It seems a little slow to me, what is causing the slow performance? Surely with the imageSHAID being indexed we should expect quicker seek times than this?

Any help would be appreciated.

Thanks for your time.

Was it helpful?

Solution

Your UPDATE clause in the MERGE updates showCount. This requires a key lookup on the clustered index.

However, the clustered index is also declared non-unique. This gives information to the optimiser even though the underlying column is unique.

So, I'd make these changes

  • the clustered primary key to be autoIncID
  • the current PK on imageSHAID to be a standalone unique index (not constraint) and add an INCLUDE for showCount. Unique constraints can't have INCLUDEs

More observations:

  • you don't need nvarchar for the hash or URL columns. These are not unicode.
  • A hash is also fixed length so can be char(64) (for SHA2-512).
  • The length of a column defines how much memory to assign to the query. See this for more: is there an advantage to varchar(500) over varchar(8000)?
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top