When is CRC more appropriate to use than MD5/SHA1?

https://stackoverflow.com/questions/996843

13-09-2019
|

Question

When is it appropriate to use CRC for error detection versus more modern hashing functions such as MD5 or SHA1? Is the former easier to implement on embedded hardware?

Solution

CRC works fine for detecting random errors in data that might occur, for example, from network interference, line noise, distortion, etc.

CRC is computationally much less complex than MD5 or SHA1. Using a hash function like MD5 is probably overkill for random error detection. However, using CRC for any kind of security check would be much less secure than a more complex hashing function such as MD5.

And yes, CRC is much easier to implement on embedded hardware, you can even get different packaged solutions for this on IC.

OTHER TIPS

CRC is designed against unintentional changes in the data. That is, it's good for detecting unintentional errors, but will be useless as a way of making sure a data was not maliciously handled.

Also see this.

I found a study that shows how inappropriate CRC hashes are for hash tables. It also explains the actual characteristics of the algorithm. The study also includes evaluation of other hash algorithms and is a good reference to keep.

The relevant conclusion on CRC for hashes:

CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose, and I recommend that you avoid doing so. If you decide to use CRC32, it's critical that you use the hash bits from the end opposite to that in which the key octets are fed in. Which end this is depends on the specific CRC32 implementation. Do not treat CRC32 as a "black box" hash function, and do not use it as a general purpose hash. Be sure to test each application of it for suitability.

UPDATE

It seems the site is down. The internet archive has a copy though.

I ran every line of this PHP code in 1.000.000 loop. Results are in comments (#).

hash('crc32', 'The quick brown fox jumped over the lazy dog.');#  750ms   8 chars
hash('crc32b','The quick brown fox jumped over the lazy dog.');#  700ms   8 chars
hash('md5',   'The quick brown fox jumped over the lazy dog.');#  770ms  32 chars
hash('sha1',  'The quick brown fox jumped over the lazy dog.');#  880ms  40 chars
hash('sha256','The quick brown fox jumped over the lazy dog.');# 1490ms  64 chars
hash('sha384','The quick brown fox jumped over the lazy dog.');# 1830ms  96 chars
hash('sha512','The quick brown fox jumped over the lazy dog.');# 1870ms 128 chars

My conclusion:

Use "crc32b" when you need http://en.wikipedia.org/wiki/Cyclic_redundancy_check and you do not care about security.
Use "sha256" (or higher) when you need added security layer.
Do not use "md5" or "sha1" because they have:
1. some security issues when you care about security
2. longer hash string and are slower than "crc32b" when all you need is CRC

For CRC information on implementation, speed and reliability see A painless guide to CRC error detection algorithms. It has everything on CRCs.

Unless somebody is going to try and modify your data maliciously and hide the change CRC is sufficient. Just use a "Good" (standard) polinomial.

You do not say what it is that you are trying to protect.

A CRC is often used in embedded systems as a check against accidental data corruption as opposed to preventing malicious system modification. Examples of the places where a CRC can be useful is to validate an EPROM image during system initialisation to guard against firmware corruption. The system bootloader will calculate the CRC for the application code and compare with the stored value before allowing the code to run. This protects against the possibility of accidental program corruption or a failed download.

A CRC can also be used in a similar manner to protect configuration data stored in FLASH or EEPROM. If the CRC is incorrect then the data can be flagged as invalid and a default or backup data set used. The CRC may be invalid due to device failure or if the user removed power during an update of the configuration data store.

There have been comments that a hash provides greater probability of detecting corruption than a CRC with multiple bit errors. This is true, and the decision on whether or not to use a 16 or 32 bit CRC will hinge upon the safety consequences of a corrupted data block being used and whether you can justify the 1 in 2^16 or 2^32 chance of a data block being incorrectly declared valid.

Many devices have a built in CRC generator for standard algorithms. The MSP430F5X series from Texas have a hardware implementation of the CRC-CCITT Standard.

CRC32 is faster and the hash is only 32bits long.

Use it when you just want a quick and light checksum. CRC is used in ethernet.

If you need more reliability it's preferable to use a modern hashing function.

It all depends on your requirements and expectation.

Here are quick brief differences between these hash function algorithms:

CRC (CRC-8/16/32/64)

is not a cryptographic hashing algorithm (it's using a linear function based on cyclic redundancy checks)
can produce either 9, 17, 33 or 65 bits
not intended to be used for cryptographic purposes since makes no cryptographic guarantees,
unsuitable for use in digital signatures, because it's easily reversible²⁰⁰⁶,
should not be used for encryption purposes,
different strings can generate the collision,
invented in 1961 and used in Ethernet and many other standards,

MD5

is a cryptographic hash algorithm,
producing a 128-bit (16-byte) hash value (32 digit hexadecimal numbers)
it is a cryptographic hash, but is considered deprecated if you worry about security,
there are known strings which have the same MD5 hash value
can be used for encryption purposes,

SHA-1

is a cryptographic hash algorithm,
produces a 160-bit (20-byte) hash value known as a message digest
it is a cryptographic hash and since 2005 it's no longer considered secure,
can be used for encryption purposes,
an example of a sha1 collision has been found
first published in 1993 (as SHA-0), then 1995 as SHA-1,
series: SHA-0, SHA-1, SHA-2, SHA-3,

In summary, using SHA-1 is no longer considered secure against well-funded opponents, because in 2005, cryptanalysts found attacks on SHA-1 which suggests it may be not secure enough for ongoing use^schneier. U.S. NIST advise that federal agencies should stop using SHA1-1 for application which require collision resistance and must use SHA-2 after 2010^NIST.

Therefore, if you're looking for simple and quick solution for checking the integrity of a files (against the corruption), or for some simple caching purposes in terms of performance, you can consider CRC-32, for hashing you may consider to use MD5, however if you're developing professional application (which should be secure and consistent), to avoid any collision probabilities - use SHA-2 and above (such as SHA-3).

Performance

Some simple benchmark test in PHP:

# Testing static text.

$ time php -r 'for ($i=0;$i<1000000;$i++) crc32("foo");'
real    0m0.845s
user    0m0.830s
sys     0m0.008s

$ time php -r 'for ($i=0;$i<1000000;$i++) md5("foo");'
real    0m1.103s
user    0m1.089s
sys     0m0.009s

$ time php -r 'for ($i=0;$i<1000000;$i++) sha1("foo");'
real    0m1.132s
user    0m1.116s
sys   0m0.010s

# Testing random number. 

$ time php -r 'for ($i=0;$i<1000000;$i++) crc32(rand(0,$i));'
real    0m1.754s
user    0m1.735s
sys     0m0.012s\

$ time php -r 'for ($i=0;$i<1000000;$i++) md5(rand(0,$i));'
real    0m2.065s
user    0m2.042s
sys     0m0.015s

$ time php -r 'for ($i=0;$i<1000000;$i++) sha1(rand(0,$i));'
real    0m2.050s
user    0m2.021s
sys     0m0.015s

What’s the difference between md5(), crc32() and sha1() crypto on PHP?

Only use CRC if computation resources are very tight (i.e. some embed environments) or you need to store/transport many output values and space/bandwidth is tight (as CRCs are usually 32-bit where an MD5 output is 128-bit, SHA1 160 bit, and other SHA variants up to 512 bit).

Never use CRC for security checks as a CRC is very easy to "fake".

Even for accidental error detection (rather than malicious change detection) hashes are better than a simple CRC. Partly because of the simple way a CRC is calculated (and partly because CRC values are usual shorter than common hash outputs so have a much smaller range of possible values) it is much more likely that, in a situation where there are two or more errors, one error will mask another so you end up with the same CRC despite two errors.

In short: unless you have reason not to use a decent hash algorithm, avoid simple CRCs.

I came across a use of CRC recently which was smart. The author of the jdupe file duplication identification and removal tool (the same author of the popular exif tool jhead) uses it during the first pass through the files. A CRC is computed on the first 32K of each file to mark files that appear to be the same, also the files must have the same size. These files are added to a list of files on which to do a full binary comparison. It speeds up checking large media files.

CRC32 is way faster and sometimes has hardware support (i.e. on Nehalem processors). Really, the only time you'd use it is if you're interfacing with hardware, or if you're really tight on performance

Lets start with the basics.

In Cryptography, a hashing algorithm converts many bits to fewer bits through a digest operation. Hashes are used to confirm integrity of messages and files.

All hashing algorithms generate collisions. A collision is when several many-bit combinations produce the same fewer bit output. The cryptographic strength of a hashing algorithm is defined by the inability for an individual to determine what the output is going to be for a given input because if they could they could construct a file with a hash that matches a legitimate file and compromise the assumed integrity of the system. The difference between CRC32 and MD5 is that MD5 generates a larger hash that's harder to predict.

When you want to implement message integrity - meaning the message hasn't been tampered with in transit - the inability to predict collisions is an important property. A 32-bit hash can describe 4 billion different messages or files using 4 billion different unique hashes. If you have 4 billion and 1 files, you are guaranteed to have 1 collision. 1 TB Bitspace has the possibility for Billions of Collisions. If I'm an attacker and I can predict what that 32 bit hash is going to be, I can construct an infected file that collides with the target file; that has the same hash.

Additionally if I'm doing 10mbps transmission then the possibility of a packet getting corrupted just right to bypass crc32 and continue along the to the destination and execute is very low. Lets say at 10mbps I get 10 errors\second. If I ramp that up to 1gbps, now I'm getting 1,000 errors per second. If I ram up to 1 exabit per second, then I have an error rate of 1,000,000,000 errors per second. Say we have a collision rate of 1\1,000,000 transmission errors, Meaning 1 in a million transmission errors results in the corrupt data getting through undetected. At 10mbps I'd get error data being sent every 100,000 seconds or about once a day. At 1gbps it'd happen once every 5 minutes. At 1 exabit per second, we're talking several times a second.

If you pop open Wireshark you'll see your typical Ethernet header has a CRC32, your IP header has a CRC32, and your TCP Header has a CRC32, and that's in addition to the what the higher layer protocols may do; e.g. IPSEC might use MD5 or SHA for integrity checking in addition to the above. There are several layers of error checking in typical network communications, and they STILL goof now and again at sub 10mbps speeds.

Cyclic Redundancy Check (CRC) has several common versions and several uncommon but generally is designed to just tell when a message or file has been damaged in transit (multiple bits flipping). CRC32 by itself is not a very good error checking protocol by today's standards in large, scalar enterprise environments because of the collision rate; the average users hard-drive can have upwards of 100k files, and file-shares on a company can have tens of millions. The ratio of hash-space to the number of files is just too low. CRC32 is computationally cheap to implement whereas MD5 isn't.

MD5 was designed to stop intentional use of collisions to make a malicious file look benign. It's considered insecure because the hashspace has been sufficiently mapped to enable some attacks to occur, and some collisions are predictable. SHA1 and SHA2 are the new kids on the block.

For file verification, Md5 is starting to be used by a lot of vendors because you can do multigigabyte files or multiterrabyte files quickly with it and stack that on top of the general OS's use and support of CRC32's. Do not be surprised if within the next decade filesystems start using MD5 for error checking.

CRC code is simpler and faster.

For what do you need any?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow