Why would you store an enum in DB?

https://softwareengineering.stackexchange.com/questions/305148

10-12-2020
|

Question

I've seen a number of questions, like this, asking for advice on how to store enums in DB. But I wonder why would you do that. So let's say that I have an entity Person with a gender field, and a Gender enum. Then, my person table has a column gender.

Besides the obvious reason of enforcing correctness, I don't see why I would create an extra table gender to map what I already have in my application. And I don't really like having that duplication.

Solution

Let's take another example that is less fraught with conceptions and expectations. I've got an enum here, and it is the set of priorities for a bug.

What value are you storing in the database?

So, I could be storing 'C', 'H', 'M', and 'L' in the database. Or 'HIGH' and so on. This has the problem of stringly-typed data. There's a known set of valid values, and if you aren't storing that set in the database, it can be difficult to work with.

Why are you storing the data in the code?

You've got List<String> priorities = {'CRITICAL', 'HIGH', 'MEDIUM', 'LOW'}; or something to that effect in the code. It means that you've got various mappings of this data to the proper format (you're inserting all caps into the database, but you're displaying it as Critical). Your code is now also difficult to localize. You have bound the database representation of the idea to a string that is stored in the code.

Anywhere you need to access this list, you either need to have code duplication or a class with a bunch of constants. Neither of which are good options. One should also not forget that there are other applications that may use this data (which may be written in other languages - the Java web application has a Crystal Reports reporting system used and a Perl batch job feeding data into it). The reporting engine would need to know the valid list of data (what happens if there's nothing marked in 'LOW' priority and you need to know that that is a valid priority for the report?), and the batch job would have the information about what the valid values are.

Hypothetically, you might say "we're a single-language shop - everything is written in Java" and have a single .jar that contains this information - but now it means that your applications are tightly coupled to each other and that .jar containing the data. You'll need to release the reporting part and the batch update part along with the web application each time there is a change - and hope that that release goes smoothly for all parts.

What happens when your boss wants another priority?

Your boss came by today. There's a new priority - CEO. Now you have to go and change all the code and do a recompile and redeploy.

With an 'enum-in-the-table' approach, you update the enum list to have a new priority. All the code that gets the list pulls it from the database.

Data rarely stands alone

With priorities, the data keys into other tables that might contain information about workflows, or who can set this priority or whatnot.

Going back to the gender as mentioned in the question for a bit: Gender has a link to the pronouns in use: he/his/him and she/hers/her... and you want to avoid hard coding that into the code itself. And then your boss comes by and you need to add you've got the 'OTHER' gender (to keep it simple) and you need to relate this gender to they/their/them... and your boss sees what Facebook has and... well, yeah.

By restricting yourself to a stringly-typed bit of data rather than an enum table, you've now needed to replicate that string in a bunch of other tables to maintain this relationship between the data and its other bits.

What about other datastores?

No matter where you store this, the same principle exists.

You could have a file, priorities.prop, that has the list of priorities. You read this list in from a property file.

You could have a document store database (like CouchDB) that has an entry for enums (and then write a validation function in JavaScript):

{
   "_id": "c18b0756c3c08d8fceb5bcddd60006f4",
   "_rev": "1-c89f76e36b740e9b899a4bffab44e1c2",
   "priorities": [ "critical", "high", "medium", "low" ],
   "severities": [ "blocker", "bad", "annoying", "cosmetic" ]
}

You could have an XML file with a bit of a schema:

<xs:element name="priority" type="priorityType"/>

<xs:simpleType name="priorityType">
  <xs:restriction base="xs:string">
    <xs:enumeration value="critical"/>
    <xs:enumeration value="high"/>
    <xs:enumeration value="medium"/>
    <xs:enumeration value="low"/>
  </xs:restriction>
</xs:simpleType>

The core idea is the same. The data store itself is where the list of valid values needs to be stored and enforced. By placing it here, it is easier to reason about the code and the data. You don't have to worry about defensively checking what you have each time (is it upper case? or lower? Why is there a chritical type in this column? etc...) because you know what you are getting back from the datastore is exactly what the datastore is expecting you to send otherwise - and you can query the datastore for a list of valid values.

The takeaway

The set of valid values is data, not code. You do need to strive for DRY code - but the issue of duplication is that you are duplicating the data in the code, rather than respecting its place as data and storing it in a database.

It makes writing multiple applications against the datastore easier and avoids having instances where you will need to deploy everything that is tightly coupled to the data itself - because you haven't coupled your code to the data.

It makes testing applications easier because you don't have to retest the entire application when the CEO priority is added - because you don't have any code that cares about the actual value of the priority.

Being able to reason about the code and the data independently from each other makes it easier to find and fix bugs when doing maintenance.

OTHER TIPS

Which of these do you think is more likely to produce mistakes when reading the query?

select * 
from Person 
where Gender = 1

select * 
from Person join Gender on Person.Gender = Gender.GenderId
where Gender.Label = "Female"

People make enum tables in SQL because they find the latter to be more readable - leading to fewer errors writing and maintaining SQL.

You could make gender a string directly in Person, but then you would have to try and enforce case. You also may increase the storage hit for the table and the query time due to the difference between strings and integers depending on how awesome your DB is at optimizing things.

I can't believe people didn't mention this yet.

Foreign Keys

By keeping the enum in your database, and adding a foreign key on the table that contains an enum value you ensure that no code ever enters incorrect values for that column. This helps your data integrity and is the most obvious reason IMO you should have tables for enums.

I'm in the camp that agrees with you. If you keep a Gender enum in your code and a tblGender in your database, you may run into trouble come maintenance-time. You'll need to document that these two entities should have the same values and thus any changes you make to one you must also make to the other.

You'll then need to pass the enum values to your stored procedures like so:

create stored procedure InsertPerson @name varchar, @gender int
    insert into tblPeople (name, gender)
    values (@name, @gender)

But think how you'd do this if you kept these values in a database table:

create stored procedure InsertPerson @name varchar, @genderName varchar
    insert into tblPeople (name, gender)
    select @name, fkGender
    from tblGender
    where genderName = @genderName --I hope these are the same

Sure relational databases are built with joins in mind, but which query is easier to read?

Here's another example query:

create stored procedure SpGetGenderCounts
    select count(*) as count, gender
    from tblPeople
    group by gender

Compare that to this:

create stored procedure SpGetGenderCounts
    select count(*) as count, genderName
    from tblPeople
    inner join tblGender on pkGender = fkGender
    group by genderName --assuming no two genders have the same name

Here's yet another example query:

create stored procedure GetAllPeople
    select name, gender
    from tblPeople

Note that in this example, you'd have to convert the gender cell in your results from an int to an enum. These conversions are easy however. Compare that to this:

create stored procedure GetAllPeople
    select name, genderName
    from tblPeople
    inner join tblGender on pkGender = fkGender

All of these queries are smaller and more maintainable when going with your idea of keeping the enum definitions out of the database.

First you need to decide if the database will only ever be used by one application or if there is a potential for multiple applications to use it. In some cases a database is nothing more than a file format for an application (SQLite databases can often be used in this regard). In this case bit duplicating the enum definition as a table can often be fine and may make more sense.

However as soon as you want to consider the possibility of having multiple applications accessing the database, then a table for the enum makes a lot of sense (the other answers go into why in more detail). The other thing to consider will you or another developer want to look at the raw database data. If so, this can be considered another application use (just one where the lab gauge is raw SQL).

If you have the enum defined in code (for cleaner code and compile time checking) as well as a table in the database, I would recommend adding unit tests to verify that the two are in sync.

I would create a Genders table for the reason that it can be used in data analysis. I could look up all the Male or Female Persons in the database to generate a report. The more ways you can view your data, the easier it will be to discover trending information. Obviously, this is very simple enumeration, but for complex enumerations (like the countries of the world, or states), it makes it easier to generate specialized reports.

When you have a code enumeration that is used to drive business logic in code you should still create a table to represent the data in the DB for the many reasons detailed above/below. Here are a few tips to insure that your DB values stay in sync with the code values:

Do not make the ID field on the table an Identity column. Include ID and Description as fields.
Do something different in the table that helps developers know that the values are semi-static/tied to a code enumeration. In all other look-up tables (usually where values can be added by users) I typically have a LastChangedDateTime and LastChangedBy, but not having them on enum related tables helps me remember that they are only changeable by developers. Document this.
Create verification code that checks to see that each value in the enumeration is in the corresponding table, and that only those values are in the corresponding table. If you have automated application "health tests" that run post-build, at it there. If not, make the code run automatically on application startup whenever the application is running in the IDE.
Create production deliver SQL scripts which do the same, but from inside the DB. If created correctly they will help with environment migrations as well.

Depends also on who access the data. If you just have one application that might be fine. If you add in a data warehouse or a reporting system. They will need to know what that code means, what is the human redable version of the code.

Usually, the type table wouldn't be duplicated as an enum in the code. You could load the type table in a list that is cached.

Class GenderList

   Public Shared Property UnfilteredList
   Public Shared Property Male = GetItem("M")
   Public Shared Property Female = GetItem("F")

End Class

Often, type come and goes. You would need a date for when the new type was added. Know when a specific type was removed. Display it only when needed. What if a client want "transgender" as a gender but other clients don't? All of this information is best stored in the database.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange