A score combines a movie, a user and a given score, right? And of course the time at which the score was given. I would definitely store web users and critics in one table. If you really think a great number of entries will arise in those tables, then you could double the flag "critics score" in the score. That would also reflect the fact that a possibly "retired" critics was 2x as important at that time. So:
Table users (user_id, is_critic tinyint, name...);
Table score (user_id, movie_id, score, is_critic tinyint, scoretime...);
The select would then simply be sum((1+is_critic) * score) / sum(1+is_critic)
, when you make 1 = critic, 0 = webuser.
- (ignore this line; just a counter "1" again, stack overflow tricks me with nimbers)
If you want to store the average rating, then do it not as the quotient (as in the example I just gave), but in the two parts of that:
sum(weighted score)
ANDsum(weighted number)
. I guess you will have a time scale sooner or later (score goes up or down, number of Votes...), so create a table with time intervals (say, weeks?) and connect your pre-averaging table to that. Then you can sum over those ratings for a movie easily. Ask in a comment if that is too compact.The single data is in the single user ratings, so for one movie you can do select all users who voted that movie, and from there on all the other rated movies of those users together with a count. That might get slow with a big number of ratings. I will thing a minute about a good aggregation of that, but I am quite sure it will involve the weeks-table too. I have no common knowledge about the usual cinema movies, if their attention is counted in days or weeks, or if you work about kinds of movies where the attention stays for months, years or longer. But even if it is 30 year, that's just 1500 weeks, so nothing long for mysql.
There a question arises: is the time between two scores important for the relation of the scored movies? Someone scores 'ICE AGE' as a great movie in the age of 13, but 2 years later he enjoys 'Pulp Fiction'. I am not really sure if that connects those two movies in the sense you mean.
As soon as you can define the relationship of them, you should define a limit how many users should have "connected" those movies (in a certain time interval) to be relevant for a "connection". In principle there arises a table with (number of movies) x (number of movies) [ x time?? ] entries, which could get a big number. Since you have a symmetric relationship, you need quite bad queries with an or
clause (bad for index usage and timing), or you should store both directions there (x is related to y with weight 0.1, so y is related to x with weight 0.1). That's why I would hold two kinds of threasholds:
Store that relation only if there are more than (very tricky number here) users who rated both good or both bad (the tricky number should depend on the overall rates of the web site, and of the overall rates of both movies)
Store only the 20 hottest relations per movie.
So there are still a few parts to have fun and headache for you, and especially part 3 will grow up to a more or less sophisticated Artificial Intelligence of rules and of "I did not mean it THAT way", so for part 3 be prepared to store data in a different technology than in MySQL. But the raw data is good in MySQL, at least for the first few million ratings. All in all that takes not much memory, so the whole rating system should fit in a reasonable sized RAM for quite some time.
So, my aggregate table would have the fields:
movie_x_id movie_y_id ratings_until users_connecting users_connecting_same users_connecting_anti
I think a user could rate a movie at most one time, so no complicated number math involved. users_connecting
is the overall number of users who voted both movies (in a certain time?), _same
would give number of users who more or less gave same direction (both good, both bad, both medium), and _anti
is number of users who found one movie great and one movie bad.
(hint: Be careful about storing the score in a good way. You might start with a system 1...10 and switch later to 1..5, which makes all so-lala-movies bad. You could define a internal saving of the score, into which every user-given score is transferred.)
If there are still questions, just ask in the comments.