Create an aggregate checksum of a column
Question
I want to compute a checksum of all of the values of a column in aggregate.
In other words, I want to do some equivalent of
md5(group_concat(some_column))
The problem with this approach is:
- It's inefficient. It has to concat all of the values of the column as a string in some temporary storage before passing it to the md5 function
- group_concat has a max length of 1024, after which everything else will be truncated.
(In case you're wondering, you can ensure that the concat of the values is in a consistent order, however, as believe it or not group_concat() accepts an order by clause within it, e.g. group_concat(some_column order by some_column)
)
MySQL offers the nonstandard bitwise aggregate functions BIT_AND(), BIT_OR() and BIT_XOR() which I presume would be useful for this problem. The column is numeric in this case but I would be interested to know if there was a way to do it with string columns.
For this particular application, the checksum does not have to be cryptologically safe.
Solution
It seems like you might as well use crc32
instead of md5
if you don't care about cryptographic strength. I think this:
select sum(crc32(some_column)) from some_table;
would work on strings. It might be inefficient as perhaps MySQL would create a temporary table (especially if you added an order by
).
OTHER TIPS
The following query is used in Percona's Mysql Table Checksumming tool. Its a little tough to understand, but essentially it CRC32
s the column (or a bunch of columns concatted) for every row, then XOR
s them all together using the BIT_XOR
group function. If one crc hash is different, the result of XOR
ing everything will also be different. This happens in fixed memory, so you can checksum arbitrarily large tables.
SELECT CONV(BIT_XOR(CAST(CRC32(column) AS UNSIGNED)), 10, 16)
One thing to keep in mind though that this does not prevent possible collisions, and CRC32
is a pretty weak function by today's standards. A nicer hashing function would be something like the FNV_64
. It would be very unlikely to have two hashes which complement each other when XOR
ed together.
SELECT crc
FROM
(
SELECT @r := MD5(CONCAT(some_column, @r)) AS crc,
@c := @c + 1 AS cnt
FROM
(
SELECT @r := '', @c := 0
) rc,
(
SELECT some_column
FROM mytable
WHERE condition = TRUE
ORDER BY
other_column
) k
) ci
WHERE cnt = @c
If the column is numeric, you could do this:
SELECT BIT_XOR(mycolumn) + SUM(mycolumn)
Of course this is easy to defeat, but it will include all the bits in the column.