Question

I have a skewed dataset, where most rows fall into the largest 10 values of my best candidate distribution key. My data is made up of two large tables, that only share two keys - my best candiate key, plus one other, but that one is null 80% of the time, so I have discounted it as an option.

Conventional wisdom says that if the data is skewed, I should use a round robin distribution. Looking at the explain plans produced by joins on the tables, I see my candidate column is the shuffle key for the shuffle move. This makes me question if I should change the distribution from round robin, to hash distributed, saving the time it takes to move data on every execution.

Is my logic correct? I feel like this is against the conventional wisdom when working with distributed sql. I don't expect any queries where this join isn't required, so that may be where others would see the benefit.

Was it helpful?

Solution

Round robin always entails data movement because of its nature, but it needn't be catastrophic for your performance. The reason you use it for skewed data is because, when you distribute by hash, a unique hash is generated for each value and the rows distributed across the 60 distributions accordingly. In your example, most of your data would end up on one (or only a few) distribution(s) and you are therefore not taking advantage of the compute available to you. Let's say you only have the opportunity of taking advantage of 20% of the compute resources available to you and the rest is idle.

The requirements for good hash columns are: they should not be updateable, not NULLable, should have large number of distinct values and even distribution.

Do you have an option to create a concatenated key from the others? That could help create a more even distribution and would be useful as long as you used it in joins between the two tables.

Just some other advice, design for your key queries, use some of the other features available in SQL DW like the right DWU, resource classes, non-clustered indexes, auto-statistics. Please also note that generation 2 of SQL DW is now available.

HTH

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top