How to balance number of ratings versus the ratings themselves?

https://stackoverflow.com/questions/2495509

21-09-2019
|

Question

For a school project, we'll have to implement a ranking system. However, we figured that a dumb rank average would suck: something that one user ranked 5 stars would have a better average that something 188 users ranked 4 stars, and that's just stupid.

So I'm wondering if any of you have an example algorithm of "smart" ranking. It only needs to take in account the rankings given and the number of rankings.

Thanks!

Solution

You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.

This approach requires two parameters:

What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number R, the "initial belief".
How much weight do you give to the initial belief, compared to the user ratings? Call this W, where the initial belief is "worth" W user ratings of that value.

With the parameters R and W, computing the new rating is simple: assume you have W ratings of value R along with any user ratings, and compute the average. For example, if R = 2 and W = 3, we compute the final score for various scenarios below:

100 (user) ratings of 4: (3*2 + 100*4) / (3 + 100) = 3.94
3 ratings of 5 and 1 rating of 4: (3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
10 ratings of 4: (3*2 + 10*4) / (3 + 10) = 3.54
1 rating of 5: (3*2 + 1*5) / (3 + 1) = 2.75
No user ratings: (3*2 + 0) / (3 + 0) = 2
1 rating of 1: (3*2 + 1*1) / (3 + 1) = 1.75

This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.

Choosing `R`

When you choose R, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4 would be a reasonable choice.

You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.

If you want to pick R using data rather than just intuition, you can use the following method:

Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
For each item, assume its "true score" is the average user rating.
Choose R to be the median of those scores.

If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).

Choosing `W`

The choice of W should depend on how many ratings a typical item has, and how consistent ratings are. W can be higher if items naturally obtain many ratings, and W should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W does not have to be an integer, and can be less than 1.

Choosing W is a more subjective matter than choosing R. However, here are some guidelines:

If a typical item obtains C ratings, then W should not exceed C, or else the final score will be more dependent on R than on the actual user ratings. Instead, W should be close to a fraction of C, perhaps between C/20 and C/5 (depending on how noisy or "spammy" ratings are).
If historical ratings are usually consistent (for an individual item), then W should be relatively small. On the other hand, if ratings for an item vary wildly, then W should be relatively large. You can think of this algorithm as "absorbing" W ratings that are abnormally high or low, turning those ratings into more moderate ones.
In the extreme, setting W = 0 is equivalent to using only the average of user ratings. Setting W = infinity is equivalent to proclaiming that every item has a true rating of R, regardless of the user ratings. Clearly, neither of these extremes are appropriate.
Setting W too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.

OTHER TIPS

Since you've stated that the machine would only be given the rankings and the number of rankings, I would argue that it may be negligent to attempt a calculated weighting method.

First, there are two many unknowns to confirm the proposition that in enough circumstances a larger quantity of ratings are a better indication of quality than a smaller number of ratings. One example is how long have rankings been given? Has there been equal collection duration (equal attention) given to different items ranked with this same method? Others are, which markets have had access to this item and, of course, who specifically ranked it?

Secondly, you've stated in a comment below the question that this is not for front-end use but rather "the ratings are generated by machines, for machines," as a response to my comment that "it's not necessarily only statistical. One person might consider 50 ratings enough, where that might not be enough for another. And some raters' profiles might look more reliable to one person than to another. When that's transparent, it lets the user make a more informed assessment."

Why would that be any different for machines? :)

In any case, if this is about machine-to-machine rankings, the question needs greater detail in order for us to understand how different machines might generate and use the rankings.

Can a ranking generated by a machine be flawed (so as to suggest that more rankings may somehow compensate for those "flawed" rankings? What does that even mean - is it a machine error? Or is it because the item has no use to this particular machine, for example? There are many issues here we might first want to unpack, including if we have access to how the machines are generating the ranking, on some level we may already know the meaning this item may have for this machine, making the aggregated ranking superfluous.

What you can find on different plattforms is the blanking of ratings without enough votings: "This item does not have enough votings"
The problem is you can't do it in an easy formula to calculate a ranking.

I would suggest a hiding of ranking with less than minimum votings but caclulate intern a moving average. I always prefer moving average against total average as it prefers votings from the last time against very old votings which might be given for totaly different circumstances.
Additionally you do not need to have too add a list of all votings. you just have the calculated average and the next voting just changes this value.

newAverage = weight * newVoting + (1-weight) * oldAverage

with a weight about 0.05 for a preference of the last 20 values. (just experiment with this weight)

Additionally I would start with these conditions:
no votings = medium range value (1-5 stars => start with 3 stars)
the average will not be shown if less than 10 votings were given.

A simple solution might be a weighted average:

sum(votes) / number_of_votes

That way, 3 people voting 1 star, and one person voting 5 would give a weighted average of (1+1+1+5)/4 = 2 stars.

Simple, effective, and probably sufficient for your purposes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

How to balance number of ratings versus the ratings themselves?

Choosing R

Choosing W

Choosing `R`

Choosing `W`