Question

I want to compute a checksum of all of the values of a column in aggregate.

In other words, I want to do some equivalent of

md5(group_concat(some_column))

The problem with this approach is:

  1. It's inefficient. It has to concat all of the values of the column as a string in some temporary storage before passing it to the md5 function
  2. group_concat has a max length of 1024, after which everything else will be truncated.

(In case you're wondering, you can ensure that the concat of the values is in a consistent order, however, as believe it or not group_concat() accepts an order by clause within it, e.g. group_concat(some_column order by some_column))

MySQL offers the nonstandard bitwise aggregate functions BIT_AND(), BIT_OR() and BIT_XOR() which I presume would be useful for this problem. The column is numeric in this case but I would be interested to know if there was a way to do it with string columns.

For this particular application, the checksum does not have to be cryptologically safe.

Was it helpful?

Solution

It seems like you might as well use crc32 instead of md5 if you don't care about cryptographic strength. I think this:

select sum(crc32(some_column)) from some_table;

would work on strings. It might be inefficient as perhaps MySQL would create a temporary table (especially if you added an order by).

OTHER TIPS

The following query is used in Percona's Mysql Table Checksumming tool. Its a little tough to understand, but essentially it CRC32s the column (or a bunch of columns concatted) for every row, then XORs them all together using the BIT_XOR group function. If one crc hash is different, the result of XORing everything will also be different. This happens in fixed memory, so you can checksum arbitrarily large tables.

SELECT CONV(BIT_XOR(CAST(CRC32(column) AS UNSIGNED)), 10, 16)

One thing to keep in mind though that this does not prevent possible collisions, and CRC32 is a pretty weak function by today's standards. A nicer hashing function would be something like the FNV_64. It would be very unlikely to have two hashes which complement each other when XORed together.

SELECT  crc
FROM
(
  SELECT @r := MD5(CONCAT(some_column, @r)) AS crc,
         @c := @c + 1 AS cnt
  FROM
    (
    SELECT @r := '', @c := 0
    ) rc,
    (
    SELECT some_column
    FROM mytable
    WHERE condition = TRUE
    ORDER BY
      other_column
    ) k
) ci
WHERE cnt = @c

If the column is numeric, you could do this:

SELECT BIT_XOR(mycolumn) + SUM(mycolumn)

Of course this is easy to defeat, but it will include all the bits in the column.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top