Question

I have an MS SQL Server 2008 database where I store places that serve food (cafés, restaurants, diners etc.). On a web site connected to this database people can rate the places on a scale from 1 to 3.

On the web site there's a page where people can view a top list with the top 25 (best rated) places in a certain city. The database structure looks something like this (there is more info stored in the tables, but here's the relevant info): Database structure: Cities- loading=Places->Votes">

A place is situated in a city and votes are placed on a place.

Up until now I've just calculated an average vote score for each place where I divide the sum of all votes for a certain place with the number of votes for that place, something like this (pseudo code):

vote_count = total number of votes for the place
vote_sum = total sum of all the votes for the place

vote_score = vote_sum/vote_count

I also have to handle divide by zero if a place has no votes. All this is done inside the stored procedure that fetches the other data that I want to display in the top list. Here is the current stored procedure that fetches the top 25 places with the highest vote score:

ALTER PROCEDURE [dbo].[GetTopListByCity]
    (
    @city_id Int
    )
AS
    SELECT TOP 25 dbo.Places.place_id, 
           dbo.Places.city_id,
           dbo.Places.place_name,
           dbo.Places.place_alias,
           dbo.Places.place_street_address,
           dbo.Places.place_street_number,
           dbo.Places.place_zip_code,
           dbo.Cities.city_name,
           dbo.Cities.city_alias,
           dbo.Places.place_phone,
           dbo.Places.place_lat,
           dbo.Places.place_lng,
           ISNULL(SUM(dbo.Votes.vote_score),0) AS vote_sum,
           (SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id) AS vote_count,
           COALESCE((CONVERT(FLOAT,SUM(dbo.Votes.vote_score))/(CONVERT(FLOAT,(SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id)))),0) AS vote_score

    FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
    LEFT OUTER JOIN dbo.Votes ON dbo.Places.place_id = dbo.Votes.place_id
    WHERE dbo.Places.city_id = @city_id
    AND dbo.Places.hidden = 0
    GROUP BY dbo.Places.place_id,
             dbo.Places.city_id,
             dbo.Places.place_name,
             dbo.Places.place_alias,
             dbo.Places.place_street_address,
             dbo.Places.place_street_number,
             dbo.Places.place_zip_code,
             dbo.Cities.city_name,
             dbo.Cities.city_alias,
             dbo.Places.place_phone,
             dbo.Places.place_lat,
             dbo.Places.place_lng
    ORDER BY vote_score DESC, vote_count DESC, place_name ASC

    RETURN

As you can see it fetches more than just the vote score - I need data about the place, the city it's situated in and so on. This works fine, but there is one big problem: the vote score is too simple because it doesn't take in to account the number of votes. With the simple calculation method a place that has one vote with the score 3 will end up higher in the list than a place that has fourteen votes with the score 3 and one vote with the score 2:

3/1 = 3
(14*3 + 1*2) = 44/15 = 2.933333333333

To fix this I've been looking into using some form of weighted average/weighted index. I've found an example of a true bayesian estimate that looks promising. It looks like this:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

where:

R = average for the place (mean) = (Rating)
v = number of votes for the place = (votes)
m = minimum number of votes required to be listed in the Top 25 (unsure how many, but somewhere between 2-5 seems realistic)
C = the mean vote across the whole database

The problems begin when I try to implement this weighted rating in a stored procedure - it quickly becomes complicated and I get tangled into parenthesis and loose track of what the stored procedure does.

Now I need some help with two questions:

Is this a suitable method for calculating a weighted index for my site?

How would this (or another suitable calculation method) look like when implemented in a stored procedure?

Était-ce utile?

La solution

I cannot see any problem with you calculations. But I can see that you are doing the same thing many times. My suggestion will help you do the aggregates in one place and then the select is quite easy.

;WITH CTE
(
    SELECT
        SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
        COUNT(*) AS CountOfVotes,
        Votes.place_id
    FROM
        Votes
    GROUP BY
        Votes.place_id
)
 SELECT TOP 25 
    dbo.Places.place_id, 
    dbo.Places.city_id,
    dbo.Places.place_name,
    dbo.Places.place_alias,
    dbo.Places.place_street_address,
    dbo.Places.place_street_number,
    dbo.Places.place_zip_code,
    dbo.Cities.city_name,
    dbo.Cities.city_alias,
    dbo.Places.place_phone,
    dbo.Places.place_lat,
    dbo.Places.place_lng,
    ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
    CTE.CountOfVotes AS vote_count,
    COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
    (CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score

FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id=CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
         dbo.Places.city_id,
         dbo.Places.place_name,
         dbo.Places.place_alias,
         dbo.Places.place_street_address,
         dbo.Places.place_street_number,
         dbo.Places.place_zip_code,
         dbo.Cities.city_name,
         dbo.Cities.city_alias,
         dbo.Places.place_phone,
         dbo.Places.place_lat,
         dbo.Places.place_lng
ORDER BY vote_score DESC, vote_count DESC, place_name ASC

The CTE function helps us reuse the calculations. So that we don't have to use SUM(vote_score) and SELECT COUNT(*) FROM Votes WHERE... multiples times. So then when you are selecting the calculations is quite easy to follow.

I hope this helps

Edit

You do not have to define the table columns in the CTE. This CTE (SumOfVoteScore, CountOfVotes, place_id) AS works as good as this CTE AS. You need to define the columns if you are using a recursive cte. Beacuse you are union with the other part.

For reference here and here you will find some information about CTE functions

Autres conseils

Thanks Arion!

I had been looking for something along the lines of CTE but I just didn't know it was that I was looking for! It's always nice to learn something new and I know I will make use of CTE's in other projects. When I implement your CTE in my stored procedure, I get this code:

ALTER PROCEDURE dbo.GetTopListByCityCTE
    (
    @city_id Int
    )
AS

;WITH CTE (SumOfVoteScore, CountOfVotes, place_id) AS
(
    SELECT
        SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
        COUNT(*) AS CountOfVotes,
        Votes.place_id
    FROM
        Votes
    GROUP BY
        Votes.place_id

)

 SELECT TOP 25 
    dbo.Places.place_id, 
    dbo.Places.city_id,
    dbo.Places.place_name,
    dbo.Places.place_alias,
    dbo.Places.place_street_address,
    dbo.Places.place_street_number,
    dbo.Places.place_zip_code,
    dbo.Cities.city_name,
    dbo.Cities.city_alias,
    dbo.Places.place_phone,
    dbo.Places.place_lat,
    dbo.Places.place_lng,
    ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
    CTE.CountOfVotes AS vote_count,
    COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
    (CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score

FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
         dbo.Places.city_id,
         dbo.Places.place_name,
         dbo.Places.place_alias,
         dbo.Places.place_street_address,
         dbo.Places.place_street_number,
         dbo.Places.place_zip_code,
         dbo.Cities.city_name,
         dbo.Cities.city_alias,
         dbo.Places.place_phone,
         dbo.Places.place_lat,
         dbo.Places.place_lng,
         CTE.SumOfVoteScore,
         CTE.CountOfVotes
ORDER BY vote_score DESC, vote_count DESC, place_name ASC

A quick check reveals that it returns the same result as the previous code, but it's much easier to read and follow and hopefully much more efficient.

Now I will have to do some experimenting with replacing the old (simple) rating calculation method with a new one that takes into account the number of votes.

Okay - so here's the stored procedure I came up with:

ALTER PROCEDURE dbo.GetTopListByCityCTE
    (
    @city_id Int
    )
AS

DECLARE @MinimumNumber float;
DECLARE @TotalNumberOfVotes int;
DECLARE @AverageRating float;
DECLARE @AverageNumberOfVotes float;

/* MINIMUM NUMBER */
SET @MinimumNumber = 1;

/* TOTAL NUMBER OF VOTES -- ALL PLACES */
SET @TotalNumberOfVotes = (
    SELECT COUNT(*) FROM Votes
);

/* AVERAGE RATING -- ALL PLACES */
SET @AverageRating = (
    SELECT
        CONVERT(FLOAT,(SUM(dbo.Votes.vote_score))) / CONVERT(FLOAT,COUNT(*)) AS AverageRating
    FROM 
        Votes);

/* AVERAGE NUMBER OF VOTES -- ALL PLACES */
/* CURRENTLY NOT USED IN INDEX - KEPT FOR REFERENCE */
SET @AverageNumberOfVotes = (
    SELECT AVG(CONVERT(FLOAT,NumberOfVotes)) FROM (SELECT COUNT(*) AS NumberOfVotes FROM Votes GROUP BY place_id) AS AverageNumberOfVotes

);
/* SUM OF ALL VOTE SCORES AND COUNT OF ALL VOTES -- INDIVIDUAL PLACES */
WITH CTE AS (
    SELECT
        CONVERT(FLOAT, SUM(dbo.Votes.vote_score)) AS SumVotesForPlace,
        CONVERT(FLOAT, COUNT(*)) AS CountVotesForPlace,
        Votes.place_id
    FROM
        Votes
    GROUP BY
        Votes.place_id
)

 SELECT 
    dbo.Places.place_id, 
    dbo.Places.city_id,
    dbo.Places.place_name,
    dbo.Places.place_alias,
    dbo.Places.place_street_address,
    dbo.Places.place_street_number,
    dbo.Places.place_zip_code,
    dbo.Cities.city_name,
    dbo.Cities.city_alias,
    dbo.Places.place_phone,
    dbo.Places.place_lat,
    dbo.Places.place_lng,
    ISNULL(CTE.SumVotesForPlace,0) AS vote_sum,
    ISNULL(CTE.CountVotesForPlace,0) AS vote_count,
    COALESCE((CTE.SumVotesForPlace/
    CTE.CountVotesForPlace),0) AS vote_score,
    ISNULL((CTE.CountVotesForPlace / (CTE.CountVotesForPlace + @MinimumNumber)) * (COALESCE((CTE.SumVotesForPlace / CTE.CountVotesForPlace),0)) + (@MinimumNumber / (CTE.CountVotesForPlace + @MinimumNumber)) * @AverageRating,0) AS WeightedIndex

FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
WHERE dbo.Places.city_id = @city_id
AND dbo.Places.hidden = 0
GROUP BY dbo.Places.place_id,
         dbo.Places.city_id,
         dbo.Places.place_name,
         dbo.Places.place_alias,
         dbo.Places.place_street_address,
         dbo.Places.place_street_number,
         dbo.Places.place_zip_code,
         dbo.Cities.city_name,
         dbo.Cities.city_alias,
         dbo.Places.place_phone,
         dbo.Places.place_lat,
         dbo.Places.place_lng,
         CTE.SumVotesForPlace,
         CTE.CountVotesForPlace
ORDER BY WeightedIndex DESC, vote_count DESC, place_name ASC

There's a variable called @AverageNumberOfVotes which is not used in the calculation, but I kept it there for reference in case it could be needed.

Running this against the data I have I get results that are slightly different than I got before, but it's no revolution and it's not quite what I needed. Here are the top 10 rows that are returned when I execute the SP above:

vote_sum        vote_count  vote_score          WeightedIndex
1110            409         2,71393643031785    2,7140960047496
807             310         2,60322580645161    2,60449697749787
38              15          2,53333333333333    2,56708633093525
25              10          2,5                 2,55442722744881
2               1           2                   2,55188848920863
2               1           2                   2,55188848920863
2               1           2                   2,55188848920863
2               1           2                   2,55188848920863
2               1           2                   2,55188848920863
2               1           2                   2,55188848920863

The problem here seems to be that where there's only one vote and the score is 2, the weighted index becomes 2,55188848920863?

The formula for calculating this index is taken from IMDB (http://www.imdb.com/chart/top) and I'm thinking that either I've done something wrong or the data that I have in my database is not comparable to the data (number of votes or voting scale) that IMDB has?

Edit

Is there a way that I could adjust this function so it works better for me? Is there a different function/approach that would work better? I still need to do the calculations in the stored procedure.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top