Domanda

This question might be more apt to programmers.stackexchange. If so, please migrate.

I am currently pondering the complexity of typical data models. Everybody knows that data models should be normalized, however on the other hand a normalized data model will require quite a few joins to reassemble the data later. And joins are potentially expensive operations, depending on the size of the tables involved. So the question I am trying to figure out, is how one would usually go about this tradeoff? I.e. in practice how many joins would you find acceptable in typical queries when designing a data model? This would be especially interesting when counting multiple joins in single queries.

As an example let's say we have users, who own houses, in which there are rooms, which have drawers, which contain items. Trivially normalizing this with tables for users, houses, rooms, drawers, and items in the sense explained above, would later require me to join five tables, when getting all the items belonging to a certain user. This seems like an awful lot of complexity to me.

Most likely the size of the tables would be involved, too. Joining five tables with little data is not as bad as three tables with millions of rows. Or is this consideration wrong?

È stato utile?

Soluzione

There're reasons for the Database Normalizations, and I've seen queries with more then 20 tables and sub-queries being joined together, working just fine for a long time. I do find the concept of normalization being a huge win, as it allows me to introduce new features to be added into the existing working applications without affecting the so-far working parts.

Databases comes with different features to make your life easier:

  • you can create views for the most commonly used queries (although this is not the only use case for views);
  • some RDBMS provides Common Table Expressions (CTE), that allow you to use named sub-queries and also recursive queries;
  • some RDBMS provides extension languages (like PL/SQL or PL/pgSQL), that allows you to develop your own functions to hide the complexity of your schema and use only API calls to operate your data.

A while back there was somehow related question on How does a SQL statement containing mutiple joins work? It might be worthwhile to look into it also.

Developing an application with a normalized database is easier, 'cos with proper approach you can isolate your schema via views/functions and make your application code being immune to the schema changes. If you'll go for the denormalized design, it might happen that design changes will affect a great deal of your code, as denormalized systems tend to be highly performance optimized at the cost of change possibilities.

Altri suggerimenti

Normalizing databases is an art form in itself.
If you structure your joins correctly you will only be grabbing the columns needed.
It should be much faster to run a query with millions of records with multiple tables and just joining the needed fields then it would if you have say one or two tables with all the records. In the second example you are retrieving all of the data and sorting through it would be a coding nightmare.
MySQL is very good only retrieving the data requested.
Just because the query is long doesn't mean it is slower.
I have seen query statements well over 20 lines of code that were very fast.

Have faith on the query you write and if you don't write a test script try it yourself.

A totally normalized data model has bigger cost in performance but is more resilient to change. A data model flat as a dime tuned for one query will perform much better but you will have to pay the price when the specs change.

So maybe the question is will the use of your data model (queries) change a lot? If not; don't normalize them only tune them for the specific queries (ask your DBA). Otherwise normalize and just by the query execution plan if you use to many joins, I can't give you a specific number.

To solve your question the answer is in:

http://en.wikipedia.org/wiki/Database_normalization

If performance becomes a problem using denormalization those issues can be solved. Thinking about that step upfront (unless you already have an expectable load) should not be done. Denormalize when it is really needed and based on measurements.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top