SQL Server normalization tactic: varchar vs int Identity

https://stackoverflow.com/questions/144109

02-07-2019
|

Question

I'm just wondering what the optimal solution is here.

Say I have a normalized database. The primary key of the whole system is a varchar. What I'm wondering is should I relate this varchar to an int for normalization or leave it? It's simpler to leave as a varchar, but it might be more optimal

For instance I can have

People
======================
name      varchar(10)   
DoB       DateTime    
Height    int  

Phone_Number
======================
name      varchar(10)   
number    varchar(15)

Or I could have

People
======================
id        int Identity   
name      varchar(10)   
DoB       DateTime  
Height    int  

Phone_Number
======================
id        int   
number    varchar(15)

Add several other one-to-many relationships of course.

What do you all think? Which is better and why?

Solution

Can you really use names as primary keys? Isn't there a high risk of several people with the same name?

If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.

And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.

OTHER TIPS

I believe that the majority of people who have developed any significant sized real world database applications will tell you that surrogate keys are the only realistic solution.
I know the academic community will disagree but that is the difference between theoretical purity and practicality.

Any reasonable sized query that has to do joins between tables that use non-surrogate keys where some tables have composite primary keys quickly becomes unmaintainable.

Using any kind of non-synthetic data (i.e. anything from the user, as opposed to generated by the application) as a PK is problematic; you have to worry about culture/localization differences, case sensitivity (and other issues depending on DB collation), can result in data problems if/when that user-entered data ever changes, etc.

Using non-user-generated data (Sequential GUIDs (or non-sequential if your DB doesn't support them or you don't care about page splits) or identity ints (if you don't need GUIDs)) is much easier and much safer.

Regarding duplicate data: I don't see how using non-synthetic keys protects you from that. You still have issues where the user enters "Bob Smith" instead of "Bob K. Smith" or "Smith, Bob" or "bob smith" etc. Duplication management is necessary (and pretty much identical) regardless of whether your key is synthetic or non-synthetic, and non-synthetic keys have a host of other potential issues that synthetic keys neatly avoid.

Many projects don't need to worry about that (tightly constrained collation choices avoid many of them, for example) but in general I prefer synthetic keys. This is not to say you can't be successful with organic keys, clearly you can, but for many projects they're not the better choice.

I think if your VARCHAR was larger you would notice you're duplicating quite a bit of data throughout the database. Whereas if you went with a numeric ID column, you're not duplicating nearly the same amount of data when adding foreign key columns to other tables.

Moreover, textual data is a royal pain in terms of comparisons, your life is much easier when you're doing WHERE id = user_id versus WHERE name LIKE inputname (or something similar).

If the "name" field really is appropriate as a primary key, then do it. The database will not get more normalized by creating a surrogate key in that case. You will get some duplicate strings for foreign keys, but that is not a normalization issue, since the FK constraint guarantrees integrity on strings just as it would on surrogate keys.

However you are not explaining what the "name" is. In practice it is very seldom that a string is appropriate as a primary key. If it is the name of a person, it wont work as a PK, since more than one person can have the same name, people can change names and so on.

One thing that others don't seem to have mentioned is that joins on int fields tend to perform better than joins on varchar fields.

And I definitely would always use a surrogate key over using names (of people or businesses) because they are never unique over time. In our database, for instance, we have 164 names with over 100 instances of the same name. This clearly shows the dangers of considering using name as a key field.

The original question is not one of normalization. If you have a normalized database, as you stated, then you do not need to change it for normalization reasons.

There are really two issues in your question. The first is whether ints or varchars a preferable for use as primary keys and foreign keys. The second is whether you can use the natural keys given in the problem definition, or whether you should generate a synthetic key (surrogate key) to take the place of the natural key.

ints are a little more concise than varchars, and a little more efficient for such things as index processing. But the difference is not overwhelming. You should probably not make your decision on this basis alone.

The question of whether the natural key provided really works as a natural key or not is much more significant. The problem of duplicates in a "name" column is not the only problem. There is also the problem of what happens when a person changes her name. This problem probably doesn't surface in the example you've given, but it does surface in lots of other database applications. An example would be the transcript over four years of all the courses taken by a student. A woman might get married and change her name in the course of four years, and now you're stuck.

You either have to leave the name unchanged, in which case it no longer agrees with the real world, or update it retroactively in all the courses the person took, which makes the database disagree with the printed rosters made at the time.

If you do decide on a synthetic key, you now have to decide whether or not the application is going to reveal the value of the synthetic key to the user community. That's another whole can of worms, and beyond the scope of this discussion.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow