Question

I have a table with the structure (content_md5 UUID, content TEXT), where I'd like to use content_md5 (which is the value of md5(content)) as the primary key, and use it as a foreign key in other tables.

This is a "static" table, where the content (some largish documents) would be referred to by their md5 value for simplicity, and to prevent duplication in the table (which wouldn't be a given with a simple SERIAL PKEY).

However, content can be NULL, which is different from an empty value to declare a non-existing content field in the referencing table.
Since md5(NULL) returns NULL, and NULL is not allowed in a primary key constraint, I'd like to have a way of having md5(NULL) return all zeros instead of NULL.

Example:

-- setup
CREATE TABLE example (content_md5 UUID PRIMARY KEY, content TEXT);
CREATE TABLE test (id SERIAL PRIMARY KEY, tags TEXT, content_md5 UUID REFERENCES example(content_md5) ON DELETE RESTRICT);
INSERT INTO  example VALUES ('00000000000000000000000000000000'::uuid, NULL);

-- usage
INSERT INTO example VALUES (md5('some text')::uuid, 'some text');
INSERT INTO test (tags, content_md5) VALUES ('some content defining tags', md5('some text')::uuid);

SELECT tags, content FROM test LEFT JOIN example USING (content_md5);

-- QUESTION: Having an md5-like function to return zero-filled "md5"/uuid?
INSERT INTO example VALUES (md5(NULL)::uuid, NULL); -- ignored, because already existing record
INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5(NULL)::uuid);

Is it possible to somehow cast the returned value to a zero-filled string, create a custom function based on md5() which replaces NULL with 00000000000000000000000000000000, or some other way to achieve this result?

/edit: Or perhaps I don't need any NULL values in this table, and can just set the referencing foreign key column to NULL to achieve the same result?

Was it helpful?

Solution

I suggest this alternative design:

-- setup
CREATE TABLE example (content_id serial PRIMARY KEY, content text);
CREATE TABLE test (id serial PRIMARY KEY, tags TEXT, content_id int REFERENCES example);

CREATE UNIQUE INDEX ON example ((md5(content)::uuid)) INCLUDE (content_id); -- !

-- usage
INSERT INTO example(content) VALUES (NULL);        -- allowed multiple times
INSERT INTO example(content) VALUES ('some text');

INSERT INTO test (tags, content_id)
SELECT 'some content defining tags', content_id
FROM   example
WHERE  md5(content)::uuid = md5('some text')::uuid;

db<>fiddle here

Major points

Use a serial column (content_id) as surrogate PK of table example - and as FK reference everywhere. 4 bytes instead of 16.

Enforce uniqueness with a unique index on the expression md5(example)::uuid. Be aware that hash collisions are possible (even if very unlikely while your table isn't huge).

While being at it, add the serial PK column to the index with an INCLUDE clause (Postgres 11 or later) to make it a covering index for fast index-only lookup.

As opposed to a PK column, this allows NULL, and NULL is not considered to be a duplicate of NULL, which should cover your use case. See:

In Postgres 10 or older don't add content_id to the index. Then you don't get index-only scans, of course:

CREATE UNIQUE INDEX ON example ((md5(content)::uuid));

Unless you want to allow only a single instance of NULL, which could be enforced with a function like you posted (introducing the risk of a collision - even if unlikely) or a tiny partial index in addition to the one above:

CREATE UNIQUE INDEX ON example (content_id)
WHERE md5(content)::uuid IS NULL;

See:

Do not store the md5 value as table column (redundantly) at all.


If you want to keep using the function you posted in your answer, consider optimizing it:

CREATE OR REPLACE FUNCTION pg_temp.md5zero(data text)
  RETURNS uuid PARALLEL SAFE IMMUTABLE LANGUAGE sql AS
$func$
SELECT COALESCE(md5(data)::uuid, '00000000000000000000000000000000')
$func$

Faster, and can be inlined. See:

OTHER TIPS

Your request sort of breaks the concept of primary keys--you want your primary key to be dependent upon another column (why not make that other column--content in your case--the primary key?), and yet at the same time you want that derivative column to be unique. It's possible to have this setup, but the design lends itself to confusion (i.e., future DBAs/developers will need to try to decipher what your design decisions were).

Also, md5() doesn't return a UUID type (though I suppose you intend to cast into UUID).

That said, I think you can use COALESCE() along with a sequence:

edb=# create sequence abc_seq;
CREATE SEQUENCE
edb=# create table abc (content_md5 text primary key, content text);
CREATE TABLE
edb=# insert into abc values (md5(coalesce('mycontent',nextval('abc_seq')::text)),'mycontent');
INSERT 0 1
edb=# insert into abc values (md5(coalesce(null,nextval('abc_seq')::text)),null);
INSERT 0 1
edb=# select * from abc;
           content_md5            |  content  
----------------------------------+-----------
 c8afdb36c52cf4727836669019e69222 | mycontent
 c4ca4238a0b923820dcc509a6f75849b | 
(2 rows)

Please also be aware that you can't set a DEFAULT on content_md5 because of the following:

edb=# create table abc (content_md5 text primary key default md5(coalesce(content,nextval('abc_seq')::text)), content text);
ERROR:  cannot use column references in default expression

I've found a way to create a custom function which does what I want.
I'm not sure this is the best way to solve it, but it works for me, so here goes:

CREATE OR REPLACE FUNCTION md5zero(data text) RETURNS text AS $$
BEGIN
    IF data IS NULL
    THEN
        RETURN '00000000000000000000000000000000';
    ELSE
        RETURN md5(data);
    END IF;
END;
$$ LANGUAGE plpgsql;

-- TEST: 
INSERT INTO example VALUES (md5zero(NULL)::uuid, NULL); -- ignored, because already existing record

-- ERROR:  duplicate key value violates unique constraint "example_pkey"
-- DETAIL:  Key (content_md5)=(00000000-0000-0000-0000-000000000000) already exists.
-- Time: 0.477 ms

INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5zero(NULL)::uuid);

-- id |         tags          |             content_md5              
------+-----------------------+--------------------------------------
--  4 | non-existing-document | 00000000-0000-0000-0000-000000000000
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top