Effecient way to model aggregate data of a many-to-one relationship (e.g. votes count on a stackoverflow question)

StackOverflow https://stackoverflow.com/questions/923352

Question

I'm curious about what be the best way to model this for optimized performance... not as concerned about real time data integrity

I'll continue with the stackoverflow example

Question
  id
  title
Votes
  id
  user
  question

A question has many votes

For many queries however, we're only concerned with the aggregate number of votes (e.g. to show next to the question).

Good relational db theory would create the two entities (Q and V) as separate relations, requiring a join then a sum or count aggregate call.

Another possibility is to break normal form and occasionally materialize the aggregate value of votes as an attribute in Question (e.g. Question.votes). Performance is gained on reads, however, depending on how stale you are willing to let your "votes" data get, it requires a lot more rights to that Question record... in turn hindering performance.

Other techniques involving caching, etc. can be used. But I'm just wondering, performance wise what's the best solution? Let's say the site is higher traffic and receiving a considerable more amount of votes than questions.

Open to non-relational models as well.

Was it helpful?

Solution

It's unlikely that a join will be too slow in this case, especially if you have an index on (question) in the Votes table.

If it is REALLY too slow, you can cache the vote count in the Question table:

 id - title - votecount

You can update the votecount whenever you record a vote. For example, from a stored procedure or directly from your application code.

Those updates are tricky, but since you're not that worried about consistency, I guess it's ok if the vote is sometimes not exactly right. To fix any errors, you can periodically regenerate all cached counts like:

 UPDATE q
 SET votecount = count(v.question)
 FROM questions q
 LEFT JOIN votes v on v.question = q.id

The aggregate count(v.question) returns 0 if no question was found, as opposed to count(*), which would return 1.

If locks are an issue, consider using "with (nolock)" or "set transaction isolation level read uncommited" to bypass locks (again, based on data integrity being a low priority.)

As an alternative to nolock, consider "read committed snapshot", which is meant for databases with heavy read and less write activity. You can turn it on with:

ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;

It is available for SQL Server 2005 and higher. This is how Oracle works by default, and it's what stackoverflow itself uses. There's even a coding horror blog entry about it.

OTHER TIPS

I used indexed views from sql 2005 all over the place for this kind of thing on a social networking site. Our load was definitely a high ratio of reads/writes so it worked well for us.

I would suggest keeping the vote in memory for the lifetime of the application. Why hit a db for something as simple as a count, when at some point you will have loaded the item once and asked what the initial amount was on a request basis. It also has alot to do with how you are implementing repositories, if your question object lazy loads votes but eager loads the count of votes then you can speed up the process while not having an issue about keeping it in memory. Still keep the votes in db, just maintain the count in your application

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top