Question

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.

Should I denormalize the user_id and include it in every table, for faster performance?


Here's one example:

  • Address belongs to user, and has a user_id
  • Envelope belongs to user, and has a user_id
  • AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).

One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.


Here's a different scenario:

  • Letter belongs to user, and has a user_id
  • Recepient belongs to Letter, and has a letter_id
  • RecepientOption belongs to Recepient, and has a recepient_id

Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?


Some notes:

  • There are never any objects that are shared between users. An entire hierarchy of related objects always belongs to the same user.
  • The user owner of objects never changes.
  • Database performance is important because it's a data intensive application. There are many queries and many tables.

So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?

Was it helpful?

Solution

I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:

user(
    #user_id
)
address(
    #user_id
,   #addres_num
)
envelope(
    #user_id
,   #envelope_num
)
address_envelope(
    #user_id
,   #addres_num
,   #envelope_num
)

(the # indicates a primary key column)

I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)

Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)

At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.

OTHER TIPS

As long as you

a) get a measurable performance improvement

and

b) know which parts of your database are real normalized data and which are redundant improvements

there is no reason not to do it!

Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.

I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.

Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top