I've built a large scale forum systems before, and the key to making it performant is to de-normalize anything and everything you can.
You cannot realistically use JOIN
on really popular pages. You must keep the number of queries you issue to the absolute minimum. You should never use sub-selects. Always be sure your indexes cover your exact use cases and no more. A query that takes longer than 1-5ms to run is probably way too slow to work on a site that's running at scale. When due to severe load things suddenly take ten times longer to run a 15ms query will take a crippling 150ms or more while your optimized 1ms queries will take an acceptable 10ms. You're aiming for them to be 0.00s all the time, and it's possible to do this.
Remember that any time you're executing a query and waiting for a response, you're not able to do anything else. If you get a little careless, you'll have requests coming in faster than you can process them and the whole system will buckle.
Keep your schema simple, even stupidly simple, and by that I mean think about the layout of your page, the information you're showing, and make the schema match that as exactly as possible. Strip it down to the bare essentials. Represent it in a format that's as close as possible to the final output without making needless compromises.
If you're showing username, avatar, post title, number of posts, date of posting, then that's the fields you have in your database. Yes, you will still have a separate users database, but transpose anything and everything you can into a straight-forward structure that makes it as simple as this:
SELECT id, username, user_avatar, post_title, post_count, post_time FROM posts
WHERE forum_id=?
ORDER BY id DESC
Normally you'd have to join against users
to get their name, maybe another table to get their particular avatar, and the discussions table to get the post count. You can avoid all that by changing your storage strategy.
In the case I was working with, it was a requirement to be able to post things in the future as well as in the past, so I had to create a specific "sort key" independent of ID, like your position
. If this is not the case for you, just use the id
primary key for ordering, something like this:
INDEX post_order (forum_id, id)
Using SUM
or COUNT
is completely out of the question. You need counter-cache columns. These are things that save counts of how many messages are in a particular forum. Yes, they will drift out of sync once in a while like any de-normalized data, so you will need to add tools to keep them in check, to rebuild completely them if required. Usually you can do this as a cron-job that runs once daily to repair any minor corruption that might've occurred. Most of the time, if you get your implementation correct, they will be perfectly in sync.
Other things to note, split up posts into threads if you can. The smaller your tables are, the faster they'll be. Sifting through all posts to find the top-level posts of each thread is brutally slow, especially on popular systems.
Also, cache anything you can get away with in something like Memcached if that's an option. For example, a user's friends listing won't change unless a friend is added or removed, so you don't need to select that list constantly from the database. The fastest database query is the one you never make, right?
To do this properly, you'll need to know the layout of each page and what information is going on it. Pages that aren't too popular need less optimization, but anything in the main line of fire will have to be carefully examined. Like a lot of things, there's probably an 80/20 rule going on, where 80% of your traffic hits only 20% of your code-base. That's where you'll want to be at your best.