I would like to have persisted column which will randomly group data for 32 groups according to one of varchar key column in table. My idea was:

SELECT ABS(CAST(HASHBYTES('MD5',[keyColumnFromTable]) AS bigint) % 31)

Questions:

  1. There is any better way to do this except CHECKSUM (different values on differents COLLATE) and own functions ?
  2. There will be any difference If i will use SELECT CAST(HASHBYTES('MD5',[keyColumnFromTable]) AS tinyint) % 31 ? I was reading that CAST to tinyint is taking into consideration last byte of data. Will be any affectt on randomness ?
有帮助吗?

解决方案

One alternative would be to use a sequence to perform round-robin allocation. Define the minimum and maximum values of the sequence to match the requisite number of buckets. Something like

CREATE SEQUENCE group_sequence
    AS tinyint
    START WITH 0
    INCREMENT BY 1
    MAXVALUE 31
    CYCLE;

The persisted column holding the group number can have a default value of ..next value for group_sequence.

The randomness of this will depend on the arrival order of new rows. If, say, you have 32 input streams and they send rows one after another each stream's rows will end up in one group, obviously. If data arrive in a batch, that batch will be evenly spread over all groups, give-or-take a few rows. Whether this is significant to the quality of your calculations I cannot say.

On the matter of casting to tinyint I do not have a mathematical answer, just an observation. Since an MD5 hash returns 16 bytes casting to bigint already produces truncation. Why would casting to tinyint be more problematic? You can perform a modulo calculation directly on the result of HASHBYTES, which will perform an implicit cast to 4-byte int.

Note also that casting the hash could produce a negative integer. Taking the absolute value after calculating the modulo will give about half the number of rows in the zero group compared to other groups. Since tinyint cannot be negative this at least would be avoided.

许可以下: CC-BY-SA归因
不隶属于 dba.stackexchange
scroll top