Question

I have 3 main problems currently in my ERD. The ERD is of a online movie DB similar to IMDB.

  1. (bottom image) Is it correct to have these 2 entities as entities separate away from the Website User entity as a critic's score is worth 2 x the same score from a regular user? Or should I list them as attribute under Website User? As this would eliminate the the doubling up of userID's. Will this make any difference when I need to actually calculate the 'final' scores of each user type later down the track?

  2. (2nd image- arrow)The user can give a rating for a particular movie. Each of these ratings are averaged out to a 'average movie rating'. Where do I attached the relationship that lists the users rating and then how would I go about then relating the calculation of each of the user's rating to come to a final average rating.

  3. (2nd and top image circles) The user 'likes' a or number of movie/s. In the movie table, there is a relation so that 'related movies' can be listed. The 'related' method comes from 2 ways;

a) Same Genre

b) Users who liked this also liked...

Where do I attach this relation as I already have a 'likes' from the user to the movie (to be displayed on users profile etc. Do I change the initial likes to a ternary relationship with the other relation going to the 'users also liked' or do I have to make a new relationship directly between the user and the 'users also liked' entity.

Pictures: https://i.stack.imgur.com/zRsXR.jpg

I'm getting pretty confused at the moment, so any input would be appreciated.

Cheers

Was it helpful?

Solution

  1. A score combines a movie, a user and a given score, right? And of course the time at which the score was given. I would definitely store web users and critics in one table. If you really think a great number of entries will arise in those tables, then you could double the flag "critics score" in the score. That would also reflect the fact that a possibly "retired" critics was 2x as important at that time. So:

    Table users (user_id, is_critic tinyint, name...);

    Table score (user_id, movie_id, score, is_critic tinyint, scoretime...);

The select would then simply be sum((1+is_critic) * score) / sum(1+is_critic), when you make 1 = critic, 0 = webuser.

  1. (ignore this line; just a counter "1" again, stack overflow tricks me with nimbers)
  2. If you want to store the average rating, then do it not as the quotient (as in the example I just gave), but in the two parts of that: sum(weighted score) AND sum(weighted number). I guess you will have a time scale sooner or later (score goes up or down, number of Votes...), so create a table with time intervals (say, weeks?) and connect your pre-averaging table to that. Then you can sum over those ratings for a movie easily. Ask in a comment if that is too compact.

  3. The single data is in the single user ratings, so for one movie you can do select all users who voted that movie, and from there on all the other rated movies of those users together with a count. That might get slow with a big number of ratings. I will thing a minute about a good aggregation of that, but I am quite sure it will involve the weeks-table too. I have no common knowledge about the usual cinema movies, if their attention is counted in days or weeks, or if you work about kinds of movies where the attention stays for months, years or longer. But even if it is 30 year, that's just 1500 weeks, so nothing long for mysql.

There a question arises: is the time between two scores important for the relation of the scored movies? Someone scores 'ICE AGE' as a great movie in the age of 13, but 2 years later he enjoys 'Pulp Fiction'. I am not really sure if that connects those two movies in the sense you mean.

As soon as you can define the relationship of them, you should define a limit how many users should have "connected" those movies (in a certain time interval) to be relevant for a "connection". In principle there arises a table with (number of movies) x (number of movies) [ x time?? ] entries, which could get a big number. Since you have a symmetric relationship, you need quite bad queries with an or clause (bad for index usage and timing), or you should store both directions there (x is related to y with weight 0.1, so y is related to x with weight 0.1). That's why I would hold two kinds of threasholds:

  • Store that relation only if there are more than (very tricky number here) users who rated both good or both bad (the tricky number should depend on the overall rates of the web site, and of the overall rates of both movies)

  • Store only the 20 hottest relations per movie.

So there are still a few parts to have fun and headache for you, and especially part 3 will grow up to a more or less sophisticated Artificial Intelligence of rules and of "I did not mean it THAT way", so for part 3 be prepared to store data in a different technology than in MySQL. But the raw data is good in MySQL, at least for the first few million ratings. All in all that takes not much memory, so the whole rating system should fit in a reasonable sized RAM for quite some time.

So, my aggregate table would have the fields:

movie_x_id    movie_y_id   ratings_until  users_connecting  users_connecting_same  users_connecting_anti 

I think a user could rate a movie at most one time, so no complicated number math involved. users_connecting is the overall number of users who voted both movies (in a certain time?), _same would give number of users who more or less gave same direction (both good, both bad, both medium), and _anti is number of users who found one movie great and one movie bad.

(hint: Be careful about storing the score in a good way. You might start with a system 1...10 and switch later to 1..5, which makes all so-lala-movies bad. You could define a internal saving of the score, into which every user-given score is transferred.)

If there are still questions, just ask in the comments.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top