Вопрос

With my webapp I'm storing cached files with a hash generated filename in various subdirectories to optimize performance levels. One way I know that I could improve performance also would be to ensure that the generated names follow a 8.3 filename structure so NTFS doesn't have to generate short filenames (I won't be able to set that in the registry).

In order to do that though I would have to trim the hash (I was thinking SHA1) to 8 characters, obviously this will increase the probability of collision substantially. What I would like to know is what is the probability of collision?

I've seen the answer here on full SHA1 hash collision rates but my maths is terrible so calculating the value is way beyond me.

Это было полезно?

Решение

Since SHA-1's output is uniformly distributed, you can approximate the collision rate using the Birthday Paradox:

Assume you keep n bits of the SHA-1 output, there is a ~50% chance that you would have a collision in a set containing 2^(n/2) records, or in other words your collision rate is approximately 1/2^(n/2)

If you need a more accurate answer, you can always use the formula in the answer you've referenced in your question.

So here, if we assume each character is 1 Byte (8 bits), then you will most likely encounter a collision if you have ~2^(8*8/2) = 4294967296 records (therefore the collision rate is going to be 2.32 * 10^-8 which is very small).

Considering the collision rate you have discovered using your test program, the ToSHA1Fingerprint() function returns a Hexadecimal string which means an 8 character sub-string of it only represents 4 bytes and hence the approximate collision rate based on the above formula is 1/2^(4*8/2) = 0.000015258789 or 0.002%.

Другие советы

It looks like the collision rate is too high for my needs anyway, I'm getting ~0.004% testing using the following code.

const int Iterations = 10;
const int Maxitems = 360000;

for (int i = 0; i < Iterations; i++)
{
    List<string> paths = new List<string>();

    for (int j = 0; j < Maxitems; j++)
    {
        string path = Path.GetRandomFileName().ToSHA1Fingerprint()
                                              .Substring(0, 8);

        paths.Add(path);
    }

    int count = paths.Distinct().Count();

    double collisionRate = ((Maxitems - count) * 100D) / Maxitems;
    collisions.Add(collisionRate);
}

double averageCollisionRate = collisions.Average();
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top