When using SHA-256 hashes as a primary key, is it OK to ignore the possibility of collisions? [duplicate]

https://stackoverflow.com/questions/23639952

21-07-2023
|

Frage

I have this situation where I have files on the HDD and I want to cache information about them in a database. Information that would otherwise take a long time to parse given that some of these files can run into GBs.

My first intuition is to use the file path as a unique identifier for a file and I use that as the key (TEXT/VARCHAR) and store the information as value in a database table.

Given that under some file systems (especially in *nix), file paths can be of unlimited length. It seems like a bad idea to use file name as a primary key in a database. Just indexing on a string field is going to be much slower, not to mention memory/space constraints.

I thought, maybe, I generate SHA-256 hash from the full file path (/usr/xxx/1/2/../abc.xyz) and use this as primary key (fixed width) in my database. Another thought, would be to generate the SHA-256 hash from file contents. However, this can also become quite time consuming.

My question is - in this scenario, are hash collisions as equally less likely, as the answer provided on this excellent thread.

Lösung

To answer your question, you would only be likely have issues with hash collisions if you were to approach 2^128 files in your table. This assumes that all inputs are between 0 .. +INF in length and that the hash algorithm you are using is perfect (SHA-256 is considered perfect in practice but not proven in theory) and that the output size is exactly 256 bits.

If you have under a few billion files, you should be fine.

Now for my recommendation. I would say that you need to tell us more information about your intended use. Your first thought is closer to correct than your hashing approach.

I would use a table like this (T-SQL Syntax for SQL Server):

CREATE TABLE [File]
(
    [Id] BIGINT IDENTITY NOT NULL,
    [Path] CHARACTER VARYING(MAX) NOT NULL

    PRIMARY KEY([Id])
);

CREATE NONCLUSTERED INDEX [File_Path_IX] ON [File]([Path]);

Then, you should let your database take care of indexing and making the searches fast. If and only if you experience a major performance issue later down the road, demonstrated by profiling, should you consider changing to a hashing approach. The hashing will impose massive computational penalty on your preprocessing and will introduce complicated scenarios such as hash collisions and trying to resolve them if and when they do happen.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow