Question

I have a requirement for a very large (in terms of the number of rows, not data) database, and need to use a very long, unique, hexadecimal PRIMARY KEY instead of the usual AUTO_INCREMENT integer.

As an example, lets say that the PK is similar to an IPv6 address, say something like 2001:0db8:0000:0000:0000:8a2e:0370:7334.

So, apart from the usual searching by the PK, there will be a need to search by subfields of it - i.e. find all keys, that begin with "2001" (covered by the PK- should be sargeable), or (say) has 0db8 in their second set of four hex characters, and then to reference this 2nd key from other tables.

What is the best way to store such a primary key (assume that I am using MariaDB or MySQL)?

My thoughts are to do one of the following:

  • one big column,

  • have 8 columns and make a composite key,

  • or something else.

From a general database point of view, what would be the best technical solution to server my requirements?

Apart from the purely technical aspects of the question above, I would also be interested in any recommendations that users here might have as to the best database engine to use in this case?

I will only have a couple more tables, that will have FOREIGN KEY references to the long hex PRIMARY KEY. I can do without the DRI (Declarative Referential Integrity) if there are major performance benefits

Was it helpful?

Solution

<TL;DR>
Regular expressions can be a very powerful tool for various data storage, manipulation and indexing requirements - regexes' capabilities in conjunction with GENERATED columns can be even more flexible and powerful and provide a decent solution to the OP's issue.

In this particular case (rowing back from my initial answer!), MySQL or its derivatives are the only feasible solution. PostgreSQL doesn't allow GENERATED fields to be FOREGIN KEYs unless they're UNIQUE. Obviously, the requirement to be able to use small substrings as FKs can't be satisfied in this case, despite the fact that MySQL's regex implementation leaves a lot to be desired! PG could use TRIGGERs - but that's another question!
</TL;DR>

A relatively simple solution to your issue exists if (and maybe it's a big if), you have PRIMARY KEY string separated by the same character. A more complex solution is possible using regular expressions - both options are explored below.

You can use GENERATED(*) columns (confusingly, these are also referred to as VIRTUAL in the Wikipedia entry, which is also one of the storage classes for such fields - the other storage classes are [STORED | PERSISTENT]. From the MariaDB documentation:

  • PERSISTENT (a.k.a. STORED): This type's value [i.e. data] is actually stored in the table.

  • VIRTUAL: This type's value is not stored at all. Instead, the value [data] is generated dynamically when the table is queried. This type is the default.

(*) There are other terms for these columns, e.g. COMPUTED BY or CALCULATED.

MySQL's syntax is (versions 5.7 and up):

col_name data_type [GENERATED ALWAYS] AS (expr)
  [VIRTUAL | STORED]  -- storage class 
  [NOT NULL | NULL]
  [UNIQUE [KEY]] [[PRIMARY] KEY]
  [COMMENT 'string']

In order to use such columns as FOREIGN KEYs, their storage class has to be STORED.

"Simple" solution:

Your PK is of the format (a fiddle with the code and tests for this part is available here):

str:str:str:str:str:str:str:str (str = 1-4 characters(*) - separated by colons)


(*) Not hex characters in this case - I messed around with MySQL's HEX() and UNHEX() functions to no avail. However, see discussion!

MySQL has an IS_IPV6() function, however, the MySQL 5.7 documentation also contains this gem:

  • CHECK
    The CHECK clause is parsed but ignored by all storage engines.

So, I'm going to use MySQL 8 which has implemented them - if you want valid IP_V6 addresses in a version prior to 8, then you'll have to use a trigger. MariaDB has CHECK constraints from at least 10.3 onwards.

You mentioned that IPv6 was (only) an example - if any set of characters will do, then just remove the CHECK constraint - and see the complex solution below for a discussion on regular expressions.

I created this MySQL 8 table (see fiddle here):

CREATE TABLE test
(
  ip_v6 VARCHAR (39) NOT NULL,
  host VARCHAR (100) NOT NULL,
  
  ip_v6_token_1 VARCHAR (4) GENERATED ALWAYS AS 
    ((SUBSTRING_INDEX(SUBSTRING_INDEX(ip_v6, ':', 1), ':', -1))) STORED,
  ip_v6_token_2 VARCHAR (4) GENERATED ALWAYS AS 
    ((SUBSTRING_INDEX(SUBSTRING_INDEX(ip_v6, ':', 2), ':', -1))) STORED,
  
  INDEX (ip_v6_token_1),  -- required for FOREIGN KEYs
  INDEX (ip_v6_token_2),  --           "
  
  CONSTRAINT test_ip_v6_pk PRIMARY KEY (ip_v6),
  CONSTRAINT ip_v6_valid_ck CHECK (IS_IPV6(ip_v6))
);

-- do the same for tokens 3 - 7 as per requirements!

I inserted a few records and then:

SELECT 
  ip_v6_token_1 AS "Token 1",
  ip_v6_token_2 AS "Token 2",
  ip_v6         AS "IPv6:",
  host          AS "Host:"
FROM test ORDER BY host;

Result:

Token 1 Token 2 IPv6:                                    Host:
   2001    0db8 2001:0db8:0000:FFF0:0000:8a2e:0370:7334 host_1
     10       9 10:9:8:7:6:6:7:6                        host_2
   4001    1111 4001:1111:2222:FFF0:0000:8a2e:0370:7334 host_3

So far, so good - we have our tokens split out as required.

Now, we check the FOREIGN KEY support - create a table:

CREATE TABLE test_fk
(
  t_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

  --
  -- Other fields
  --

  token_1 VARCHAR (4) NOT NULL,
  token_2 VARCHAR (4),
  
  CONSTRAINT test_fk_test_ip_v6_token_1_fk 
    FOREIGN KEY (token_1) 
      REFERENCES test (ip_v6_token_1) ON DELETE CASCADE ON UPDATE CASCADE,
  
  CONSTRAINT test_fk_test_ip_v6_token_2_fk 
    FOREIGN KEY (token_2) 
      REFERENCES test (ip_v6_token_2) ON DELETE CASCADE ON UPDATE CASCADE
);

So, I tested the ON DELETE CASCADE and the ON UPDATE CASCADE functionality and they appear to work. Also, when one tries to INSERT on test_fk it fails if there isn't a corresponding token in the ip_v6 field of the test parent table - see the fiddle for the tests.

Complex solution (modelled on the above):

This part of the answer makes extensive use of regular expressions. I'm taking IPv6 as an exemplar of the power and flexibility of regexes - ultimately, you should be able to write your own in line with your particular requirements.

A fiddle for the code and tests can be found here.

As above, I CREATEd a table:

CREATE TABLE test
(
  ip_v6 VARCHAR (39) NOT NULL, 
  host VARCHAR (100) NOT NULL,


  ip_v6_token_1 VARCHAR(39) GENERATED ALWAYS AS 
    (SUBSTRING_INDEX(SUBSTRING_INDEX(ip_v6, ':', 1), ':', -1)),
  ip_v6_token_2 VARCHAR(39) GENERATED ALWAYS AS 
    (SUBSTRING_INDEX(SUBSTRING_INDEX(ip_v6, ':', 2), ':', -1)),

  -- as in simple solution, same for tokens 3-7 as per requirements!

  INDEX (ip_v6_token_1),  -- required for FOREIGN KEYs
  INDEX (ip_v6_token_2),


  CONSTRAINT test_ip_v6_pk PRIMARY KEY (ip_v6),
  CONSTRAINT ip_v6_regexp_ck 
    CHECK(ip_v6 REGEXP '^([[:xdigit:]]{1,4}:){7,7}[[:xdigit:]]{1,4}$')    
);

So, as you see, the change is on the CHECK constraint on the ip_v6 field:

  CONSTRAINT ip_v6_regexp_ck 
    CHECK(ip_v6 REGEXP '^([[:xdigit:]]{1,4}:){7,7}[[:xdigit:]]{1,4}$')

A brief explanation of the regular expression '^([[:xdigit:]]{1,4}:){7,7}[[:xdigit:]]{1,4}$':

  • the ^ character is an anchor - it matches the beginning of a line.
    This ensures that the format of the ip_v6 string is (from the beginning) of 1-4 hex characters (of any combination) in the ranges of 0-9 OR a-f OR A-F - from here - groups of characters enclosed in [ & ] are called character classes. My regex above is equivalent to '^([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}$'. From here:

When it is necessary to specify the minimum and maximum number of repeats explicitly, the bounds operator {} may be used. Thus. a{2} is the letter ‘a’ repeated exactly twice, a{2,4} represents the letter ‘a’ repeated between 2 and 4 times, and a{2,} represents the letter ‘a’ repeated at least twice with no upper limit. Note that there must be no whitespace inside the {}, and there is no upper limit on the values of the lower and upper bounds.

  • the {1,4} [:xdigit:] characters are followed by a (character literal) : - it has no special meaning in regexps.

  • the enclosing parentheses (()) (important for this discussion, see below) mean that the groups of hex digits concatenated with a single : are to be treated as a group - "captured groups" - see discussion re. MySQL vs. PostgreSQL!

  • the {7,7} means 7 and no more nor no less than 7 repeats of that group can be present - since it ends with a colon, it can't be the last element. That is taken care of by the final group - [0-9a-fA-F]{1,4}$ - where the usual 1 - 4 digit hex is followed by the $ character.

  • the $ character is another anchor, this time signifying the end of the line - so the end of our ip_v6 string must be the hex character(s) without a trailing :!

So, we've ensured that our ip_v6 string at least corresponds to some of the basics about IPv6 addresses - now what can we do?

As can be seen, there are INDEXes (or KEYs - synonyms in MySQL) for the three token fields generated - so far so good. I don't have a realistic dataset to test, but if you do, you can see the effect of these indexes.

You can see that I've used SUBSTRING_INDEX (esp. from here) - which is all well and good - but this will become more and more (totally) restrictive as your requirements become more complex! For some reason, the relatively easy SUBSTRING() function is "disallowed" in GENERATED columns.

What you really need is the power of Regular expressions - these are very powerful and come in handy in lots of scenarios - like the one in this question!

You will see in the fiddle that I have run a few tests - some of which work and some of which fail - as with the simple solution!

Actual regexps for IPv6 addresses can be found here, here. These are not for the faint-hearted - first link's example:

  • (([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))

That's over 660 characters, and the second link's example is 1080 characters:

  • ^\s*((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?\s*$

However, the IS_IPV6() function removes the need to use horrors like these!

Whether or which is correct/better is a bit above my pay-grade, but you can see from the discussion under the first link's answer (> 250 upvotes!) that it's a complex task to get right but also that regular expressions can be very powerful!

However, all is not rosy in the garden - if you look at my complex solution fiddle, I've added some regular expression tests and MySQL fails on a few - notably those using captured groups (that is, bracketing using ( and ).

An example is this one (see the MySQL 8 complex fiddle):

It SHOULD work like this (see the last code snippet in the PostgreSQL fiddle here):

SELECT
  REGEXP_REPLACE
  (
    ip_v6, '(^[[:xdigit:]]{1,4}):([[:xdigit:]]{1,4}):([[:xdigit:]]{1,4}):(.*$)', '\4 xx \3   \2   \1'
  ) 
  AS "Rep text 2",
  ip_v6
FROM test;

Result:

Rep text 2                                      ip_v6
FFF0:0000:8a2e:0370:7334 xx 0000   0db8   2001  2001:0db8:0000:FFF0:0000:8a2e:0370:7334
123:123:123:123:123      xx  321    654    987  987:654:321:123:123:123:123:123

Notice the 5 last tokens of the ip_v6 string followed by xx followed by tokens 3, 2 & 1 in that order - i.e. reversed as per the captured groups (\3 &c.) in the (brackets).

MySQL returns (fiddle - last snippet in the fiddle):

Rep text 2      ip_v6
4 xx 3   2   1  123:123:123:123:123:123:123:123
4 xx 3   2   1  2001:0db8:0000:FFF0:0000:8a2e:0370:7334

MySQL simply treats the replacements for the captured groups (i.e. what would be the tokens) as string literals (4, 3, 2, 1) - this is quite simply appalling - instead of failing, it returns the WRONG DATA. MySQL does the same thing - incorrect data instead of failing for character class shortcuts - \d fails as a replacement for [[:digit:]] which works for PostgreSQL. Again, instead of failing, it returns an incorrect result!

In this PostgreSQL fiddle, I set up the GENERATED columns and my FOREIGN KEYs as per the MySQL complex solution fiddle - however, with PostgreSQL, there is a total showstopper for the OP's requirements.

There has to be a UNIQUE index (not just a simple one) on that column for it to be able to be used as an FK. Just for the sake of completeness, I did this, but it completely puts the kibosh on using PostgreSQL for this question's requirements - but hey, I learnt a lot going through all of this - I hope you have too!

OTHER TIPS

The data type you choose to store this key isn't going to make much of a difference for your performance. Rather, you should store the pieces of the key that you'll need to search on, in their own columns as well that way you can avoid wildcard searches (contains, ends-with, etc) which are inefficient and do exact searches instead. You can achieve this either from your application that's generating / managing the key or you can use a generated persistent column which you can read about here. Either way, by breaking out the pieces of the key you'll end up doing exact match searches on allows you to index those pieces efficiently.

It may be worth considering FULLTEXT index in MySQL. For example, these table definitions are allowed in MySQL 5.7:

CREATE TABLE `nested_obj` (
  `id` varchar(1000) NOT NULL,
  `val` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `id` (`id`)
);

CREATE TABLE `event` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `obj_id` varchar(1000) DEFAULT NULL,
  `type` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY (`obj_id`),
  FULLTEXT KEY `fulltext_obj_id` (`obj_id`),
  CONSTRAINT FOREIGN KEY (`obj_id`) REFERENCES `nested_obj` (`id`)
);

Then you can do non-joined searches in event table using match(obj_id) against('word') predicates, it understands colon as word delimiter. You may additionally use regexp predicate to set the position of a searched word, like:

select obj_id, count(*) from event
where match(obj_id) against('sixfour')
and match(obj_id) against('fiveeight')
and obj_id regexp '^([^:]+:){1}sixfour'
group by 1;
+--------------------------------------------------------+----------+
| obj_id                                                 | count(*) |
+--------------------------------------------------------+----------+
| ninesevennine:sixfour:fiveeight:fourfour               |        2 |
| threeeight:sixfour:twonine:fiveeight                   |        5 |
| threefivefive:sixfour:fiveeight:sevenfivesix           |        1 |
| zerofournine:sixfour:sevenzerotwo:sevenseven:fiveeight |        8 |
+--------------------------------------------------------+----------+

select obj_id, count(*) from event
where match(obj_id) against('sixfour')
and match(obj_id) against('fiveeight')
and obj_id regexp '^([^:]+:){2}sixfour'
group by 1;
+-----------------------------------------------------+----------+
| obj_id                                              | count(*) |
+-----------------------------------------------------+----------+
| fiveeight:oneone:sixfour:fourfiveeight:sixzero      |        6 |
| fiveeight:threefiveseven:sixfour:zerozerofive       |        7 |
| zerosixthree:sevensix:sixfour:fiveeight:onezerofive |        4 |
+-----------------------------------------------------+----------+

But the performance benefits of this solution are unclear. I guess it will highly depend on the particular distributions of values in the dataset.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top