Question

Alright, so I enjoy making forum software with PHP and MySQL, though there's one thing that has always troubled me, and one thing only;

The main page of the forums, where you view the list of the forums. Each forum shows the forum name, the number of posts made in that forum, the number of discussions made in that forum, and the last poster in the forum. There lies the problem, getting all of that data when all of those things are stored in different tables. It's not much of a problem to GET it, not really a problem at all, but to do it EFFICIENTLY is what I'm after.

My current approach is this; Store the current number of posts, discussions, and the last poster statically in the forum table itself instead of going out and grabbing the data from the different tables - "posts", "discussions", "forums", etc. Then when a user posts, it updates that "forums" table, incrementing the number of posts by 1 and updating the last poster, and also incrementing the discussions by 1 if they're making a new discussion. This just seems inefficient and dirty to me for some reason, but maybe it's just me.

And here's another approach that I fear would be horribly inefficient; Actually going out to each table - "posts", "discussions", "forums" - and grabbing the data. The problem with this is, there can be hundreds of forums on one page... And I'd have to use a COUNT statement to fetch the number of posts or discussions, meaning I'd have to use subqueries - not to mention a third subquery to fetch the last poster. That being said... The query would be something like this psuedo-code-like-thing:

SELECT foruminfo, (
    SELECT COUNT(id)
    FROM posts
    WHERE forumId = someid
), (
    SELECT COUNT(id)
    FROM discussions
    WHERE forumId = someid
), (
    SELECT postinfo
    FROM posts
    WHERE forumId = someid
    ORDER BY postdate
    DESC LIMIT 1
)
FROM forums
ORDER BY position DESC;

So basically those subqueries could be run hundreds of times if I have hundreds of forums being listed. And with hundreds of users viewing the page every second, would this not put quite a bit of strain on? I'm not entirely sure if subqueries cause the same amount of load as normal queries or not, but if they do then it seems like it would certainly be horribly inefficient.

Any ideas? :(

Was it helpful?

Solution

I've built a large scale forum systems before, and the key to making it performant is to de-normalize anything and everything you can.

You cannot realistically use JOIN on really popular pages. You must keep the number of queries you issue to the absolute minimum. You should never use sub-selects. Always be sure your indexes cover your exact use cases and no more. A query that takes longer than 1-5ms to run is probably way too slow to work on a site that's running at scale. When due to severe load things suddenly take ten times longer to run a 15ms query will take a crippling 150ms or more while your optimized 1ms queries will take an acceptable 10ms. You're aiming for them to be 0.00s all the time, and it's possible to do this.

Remember that any time you're executing a query and waiting for a response, you're not able to do anything else. If you get a little careless, you'll have requests coming in faster than you can process them and the whole system will buckle.

Keep your schema simple, even stupidly simple, and by that I mean think about the layout of your page, the information you're showing, and make the schema match that as exactly as possible. Strip it down to the bare essentials. Represent it in a format that's as close as possible to the final output without making needless compromises.

If you're showing username, avatar, post title, number of posts, date of posting, then that's the fields you have in your database. Yes, you will still have a separate users database, but transpose anything and everything you can into a straight-forward structure that makes it as simple as this:

SELECT id, username, user_avatar, post_title, post_count, post_time FROM posts
  WHERE forum_id=?
  ORDER BY id DESC

Normally you'd have to join against users to get their name, maybe another table to get their particular avatar, and the discussions table to get the post count. You can avoid all that by changing your storage strategy.

In the case I was working with, it was a requirement to be able to post things in the future as well as in the past, so I had to create a specific "sort key" independent of ID, like your position. If this is not the case for you, just use the id primary key for ordering, something like this:

INDEX post_order (forum_id, id)

Using SUM or COUNT is completely out of the question. You need counter-cache columns. These are things that save counts of how many messages are in a particular forum. Yes, they will drift out of sync once in a while like any de-normalized data, so you will need to add tools to keep them in check, to rebuild completely them if required. Usually you can do this as a cron-job that runs once daily to repair any minor corruption that might've occurred. Most of the time, if you get your implementation correct, they will be perfectly in sync.

Other things to note, split up posts into threads if you can. The smaller your tables are, the faster they'll be. Sifting through all posts to find the top-level posts of each thread is brutally slow, especially on popular systems.

Also, cache anything you can get away with in something like Memcached if that's an option. For example, a user's friends listing won't change unless a friend is added or removed, so you don't need to select that list constantly from the database. The fastest database query is the one you never make, right?

To do this properly, you'll need to know the layout of each page and what information is going on it. Pages that aren't too popular need less optimization, but anything in the main line of fire will have to be carefully examined. Like a lot of things, there's probably an 80/20 rule going on, where 80% of your traffic hits only 20% of your code-base. That's where you'll want to be at your best.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top