Question

Using Azure SQL DW, I have created a secondary index on a single column in a table, yet I'm unsure if the index is ever being used by my query. The performance is still slow, but I'm searching about 7 billion rows of data.

My table is essentially:

CREATE TABLE FactBusinessEvent
(
    [EmailAddress] [nvarchar](200) NOT NULL,
    [EventDate] [datetime] NOT NULL,
    [EventDate_key] [int] NOT NULL,
   -- OTHER COLUMNS HERE
)
WITH
(
    DISTRIBUTION = HASH ( [EmailAddress] ),
    CLUSTERED COLUMNSTORE INDEX
);

CREATE INDEX IX_FactBusinessEvent_EmailAddress ON FactBusinessEvent
(
   EmailAddress ASC
);

And my query is:

SELECT * FROM FactBusinessEvent WHERE EmailAddress = 'test@test.com'

Using SSMS 17.6, I can show the estimated query plan and it completely ignores the secondary index, showing a single Get from the table. I can't seem to use hints in SQL DW, so is there anything else to try?

Thanks for any insight.

Was it helpful?

Solution

As you have chosen to hash distribute your table on EmailAddress, this will mean all email addresses with the same value will end up with the same hash and consequently the same distribution - where SQL DW always has 60 distributions distributed across a number of nodes. As such you won't be making best use of the compute available to you.

Having said that, can you confirm what DWU you are running at, what resource class is associated with the user you are running as and that you have created the relevant statisics ( ie on Email )?

Looking at your secondary index, it only contains one column, so is best suited to queries only containing that column, or for small point-lookups (assuming SQL DW behaves in a similar manner to SQL Server on this which isn't necessarily true). Even it did this, it would then have to fetch the other columns from the main columnstore index to service your SELECT *.

Have a look at this article for advice on hash distributing large tables: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#hash-distribute-large-tables

If this is one of your business-critical queries, you could consider a different hashing column, or even experiment with round robin distribution. Is that something you join on for example? In in this simple example, I create a copy of the table using ROUND_ROBIN distribution and run the query against that table:

CREATE TABLE FactBusinessEvent_rr
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM dbo.FactBusinessEvent;
GO

-- Create the required statistics
CREATE STATISTICS _st_FactBusinessEvent_Email_rr ON dbo.FactBusinessEvent_rr ( EmailAddress );
-- other stats here, ie columns you will join on, use in WHERE clause, or aggregate on
-- ...
GO

SELECT * 
FROM dbo.FactBusinessEvent_rr 
WHERE EmailAddress = 'test@test.com'
OPTION ( LABEL = 'email round robin query' );
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top