Question

Suppose I have 4 types of services I offer (they are unlikely to change often):

  • Testing
  • Design
  • Programming
  • Other

Suppose I have 60-80 of actual services that each fall into one of the above categories. For example, 'a service' can be "Test Program using technique A" and it is of type "Testing".

I want to encode them into a database. I came up with a few options:

Option 0:

Use VARCHAR directly to encode service type directly as a string

Option 1:

Use database enum. But, enum is evil

Option 2:

use two tables:

service_line_item (id, service_type_id INT, description VARCHAR);
service_type (id, service_type VARCHAR);

I can even enjoy referential integrity:

ALTER service_line_item 
    ADD FOREIGN KEY (service_type_id) REFERENCES service_type (id);

Sounds good, yes?

But I still have to encode things and deal with integers, i.e when populating the table. Or I have to create elaborate programming or DB constructs when populating or dealing with the table. Namely, JOINs when dealing with the database directly, or creating new object oriented entities on the programming side, and making sure I operate them correctly.

Option 3:

Don't use enum, do not use two tables, but just use an integer column

service_line_item (
    id,
    service_type INT,        -- use 0, 1, 2, 3 (for service types)
    description VARCHAR
);

This is like a 'fake enum' that requires more overhead on the code side of things, like i.e. knowing that {2 == 'Programming'} and dealing with it appropriately.

Question:

Currently I have implemented it using Option 2, guided under concepts

  1. do not use enum (option 1)
  2. avoid using a database as a spreadsheet (option 0)

But I can't help to feel that seems wasteful to me in terms of programming and cognitive overhead -- I have to be aware of two tables, and deal with two tables, vs one.

For a 'less wasteful way', I am looking at Option 3. IT is lighter and requires essentially the same code constructs to operate (with slight modifications but complexity and structure is basically the same but with a single table)

I suppose ideally it is not always wasteful, and there are good cases for either option, but is there a good guideline as to when one should use Option 2 and when Option 3?

When there are only two types (binary)

To add a bit more to this question... in the same venue, I have a binary option of "Standard" or "Exception" Service, which can apply to the service line item. I have encoded that using Option 3.

I chose not to create a new table just to hold values {"Standard", "Exception"}. So my column just holds {0, 1} and my column name is called exception, and my code is doing a translation from {0, 1} => {STANDARD, EXCEPTION} (which I encoded as constants in programming language)

So far not liking that way either..... (not liking option 2 nor option 3). I do find option 2 superior to 3, but with more overhead, and still I cannot escape encoding things as integers no matter which option I use out of 2, and 3.

ORM

To add some context, after reading answers - I have just started using an ORM again (recently), in my case Doctrine 2. After defining DB schema via Annotations, I wanted to populate the database. Since my entire data set is relatively small, I wanted to try using programming constructs to see how it works.

I first populated service_types, and then service_line_items, as there was an existing list from an actual spreadsheet. So things like 'standard/exception' and 'Testing' are all strings on the spreadsheet, and they have to be encoded into proper types before storing them in DB.

I found this SO answer: What do you use instead of ENUM in doctrine2?, which suggested to not use DB's enum construct, but to use an INT field and to encode the types using 'const' construct of the programming language.

But as pointed out in the above SO question, I can avoid using integers directly and use language constructs -- constants -- once they are defined....

But still .... no matter how you turn it, if I am starting with string as a type, I have to first convert it to a proper type, even when using an ORM.

So if say $str = 'Testing';, I still need to have a block somewhere that does something like:

switch($str):
{ 
    case 'Testing':  $type = MyEntity::TESTING; break;
    case 'Other':    $type = MyEntity::OTHER; break;
}

The good thing is you are not dealing with integers/magic numbers [instead, dealing with encoded constant quantities], but the bad thing is you can't auto-magically pull things in and out of the database without this conversion step, to my knowledge.

And that's what I meant, in part, by saying things like "still have to encode things and deal with integers". (Granted, now, after Ocramius' comment, I won't have to deal directly with integers, but deal with named constants and some conversion to/from constants, as needed).

Was it helpful?

Solution

Option #2, using reference tables, is the standard way of doing it. It has been used by millions of programmers, and is known to work. It is a pattern, so anyone else looking at your stuff will immediately know what is going on. There exist libraries and tools that work on databases, saving you from lots and lots of work, that will handle it correctly. The benefits of using it are innumerable.

Is it wasteful? Yes, but only slightly. Any half-decent database will always keep such frequently joined small tables cached, so the waste is generally imperceptible.

All other options that you described are ad hoc and hacky, including MySQL's enum, because it is not part of the SQL standard. (Other than that, what sucks with enum is MySQL's implementation, not the idea itself. I would not mind seeing it one day as part of the standard.)

Your final option #3 with using a plain integer is especially hacky. You get the worst of all worlds: no referential integrity, no named values, no definitive knowledge within the database of what a value stands for, just arbitrary integers thrown all over the place. By this token, you might as well quit using constants in your code, and start using hard-coded values instead. circumference = radius * 6.28318530718;. How about that?

I think you should re-examine why you find reference tables onerous. Nobody else finds them onerous, as far as I know. Could it be that it is because you are not using the right tools for the job?

Your sentence about having to "encode things and deal with integers", or having to "create elaborate programming constructs", or "creating new object oriented entities on the programming side", tells me that perhaps you may be attempting to do object-relational mapping (ORM) on the fly dispersed throughout the code of your application, or in the best case you may be trying to roll your own object-relational mapping mechanism, instead of using an existing ORM tool for the job, such as Hibernate. All these things are a breeze with Hibernate. It takes a little while to learn it, but once you have learned it, you can really focus on developing your application and forget about the nitty gritty mechanics of how to represent stuff on the database.

Finally, if you want to make your life easier when working directly with the database, there are at least two things that you can do, that I can think of right now:

  1. Create views that join your main tables with whatever reference tables they reference, so that each row contains not only the reference ids, but also the corresponding names.

  2. Instead of using an integer id for the reference table, use a CHAR(4) column, with 4-letter abbreviations. So, the ids of your categories would become "TEST", "DSGN", "PROG", "OTHR". (Their descriptions would remain proper English words, of course.) It will be a bit slower, but trust me, nobody will notice.

Finally, when there are only two types, most people just use a boolean column. So, that "standard/exception" column would be implemented as a boolean and it would be called "IsException".

OTHER TIPS

Option 2 with constants or enums on the programming end.
Although it duplicates knowledge, violating the Single Source Of Truth principle, you can deal with it by using the Fail-fast technique. When your system loads it would check that the enums or const values exist in the database. If not, the system should throw an error and refuse to load. It will generally be cheaper to fix this bug at this time than later on when something more serious may have happened.

There's nothing to stop you using [short] strings as keys, so you could still have the readability of names in your tables and not resort to meaningless surrogate number encoding. You should still have the separate table to describe Service Types, just on the off-chance that, say, your application goes international!

Your Users can see your four categories in their own language, but your database tables still contain values that you can read - and none of it requires any database structure or code changes!

table service_type 
( id VARCHAR 
, name VARCHAR 
  primary key ( id ) 
);
table service_line_item 
( id 
, service_type VARCHAR 
, description VARCHAR
  foreign key ( service_type ) references service_type ( id )
);

select * from service_type ; 

+-------------+----------------+
| id          | name           |
+-------------+----------------+
| Testing     | Testen         |
| Design      | Design         | 
| Programming | Programmierung |
| Other       | Andere         |
+-------------+----------------+

or, for your French customers ...

update services_types set name = 'Essai'         where id = 'Testing'; 
update services_types set name = 'Conception'    where id = 'Design'; 
update services_types set name = 'Programmation' where id = 'Programming'; 
update services_types set name = 'Autre'         where id = 'Other'; 

option #2 is the ideal choice. The overhead is not such that it requires consideration of other options. With this option, the database will remain organized and easy to understand.

Option #3 is faster than option #2 but it will require you to keep track of which integer means what. If for any reason if you want to change the number then it might require some changes in many places in your code. As a programmer what ensure is there shouldn't be any loopholes in architecture and there should be a top place from where I can control a specific task.

Licensed under: CC-BY-SA with attribution
scroll top