Question

Hope you're doing fine.

I need a help a bit with this database:

enter image description here

This is a database that stores votes. Users pick the audio tracks they like, and they vote for them. They can vote 'up' or 'down'. Easy as pie. But, when it comes to the calculating stats it gets hairy.

Meta

It's a key-value styled table, that stores the most commonly used stats (just sort-of caching):

mysql> SELECT * FROM Meta;
+-------------+-------+
| Key         | Value |
+-------------+-------+
| TRACK_COUNT | 2620  |
| VOTE_COUNT  | 3821  |
| USER_COUNT  | 371   |
+-------------+-------+

Vote

The vote table holds the vote itself. The only interesting field here is the Type, value of which means:

  1. 0 - App made Vote, user voted for the track using the UI
  2. 1 - Imported Vote (from external service)
  3. 2 - Merged Vote. Actually the same as the Imported Vote, but it actually makes a note, that this user is already voted for this track using the external service, and now he's repeating himself using the App.

Track

The track is holds the total stats for itself. Amount of likes, dislikes, likes from external service (LikesRP), dislikes from external service (DislikesRP), likes/dislikes adjustments.

App

The app requires to get the votes for:

  1. 5 most up-voted tracks during the last 7 days
  2. 5 most down-voted tracks during the last 7 days
  3. 5 most up-voted tracks during the last 7 days, votes of which were imported from the external service (Vote.Type = 1)
  4. 100 most up-voted tracks during the last month

To get the 100 most-up voted track I use this query:

SELECT
    T.Hash,
    T.Title,
    T.Artist,
    COALESCE(X.VotesTotal, 0) + T.LikesAdjust as VotesAdjusted
FROM (
    SELECT
        V.TrackHash,
        SUM(V.Vote) AS VotesTotal
    FROM
        Vote V
    WHERE
        V.CreatedAt > NOW() - INTERVAL 1 MONTH AND V.Vote = 'up'
    GROUP BY
        V.TrackHash
    ORDER BY
        VotesTotal DESC
) X
RIGHT JOIN Track T
    ON T.Hash = X.TrackHash
ORDER BY
    VotesAdjusted DESC
LIMIT 0, 100;

This query is working OK and it honors the adjustments (client wanted to adjust the track position in lists). Almost the same query is used to get the 5 most up/down voted tracks. And query for task #3 is this:

SELECT
    T.Hash,
    T.Title,
    T.Artist,
    COALESCE(X.VotesTotal, 1) as VotesTotal
FROM (
    SELECT
        V.TrackHash,
        SUM(V.Vote) AS VotesTotal
    FROM
        Vote V
    WHERE
        V.Type = '1' AND
        V.CreatedAt > NOW() - INTERVAL 1 WEEK AND
        V.Vote = 'up'
    GROUP BY
        V.TrackHash
    ORDER BY
        VotesTotal DESC
) X
RIGHT JOIN Track T
    ON T.Hash = X.TrackHash
ORDER BY
    VotesTotal DESC
LIMIT 0, 5;

The problem is that the first query is taking about 2 seconds to perform and we have less than 4k votes. By the end of year, this figure will be about 200k votes, which most likely will kill this database. So I'm figuring out how to solve this puzzle.

And now I came down to these questions:

  1. Did I make the database design wrong? I mean, could it be better?
  2. Did I make the query wrong?
  3. Anything else I could improve?

The first thing I did was caching. But, OK, this solves the problem drastically. But I'm curious about SQL-related solution (always leaning towards perfection).

The second thing I had an idea was to put those calculated values to the Meta table and change them during the voting procedure. But I'm quite short on time just to try it out. Would it be worth it by the way? Or, how do the enterprise class apps solve these problems?

Thanks.

EDIT

I can't believe I forgot to include indices. Here they are:

mysql> SHOW INDEXES IN Vote;
+-------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name                | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Vote  |          0 | UNIQUE_UserId_TrackHash |            1 | UserId      | A         |         890 |     NULL | NULL   |      | BTREE      |         |
| Vote  |          0 | UNIQUE_UserId_TrackHash |            2 | TrackHash   | A         |        4450 |     NULL | NULL   |      | BTREE      |         |
| Vote  |          1 | INDEX_TrackHash         |            1 | TrackHash   | A         |        4450 |     NULL | NULL   |      | BTREE      |         |
| Vote  |          1 | INDEX_CreatedAt         |            1 | CreatedAt   | A         |        1483 |     NULL | NULL   |      | BTREE      |         |
| Vote  |          1 | UserId                  |            1 | UserId      | A         |        1483 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

mysql> SHOW INDEXES IN Track;
+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Track |          0 | PRIMARY        |            1 | Hash        | A         |        2678 |     NULL | NULL   |      | BTREE      |         |
| Track |          1 | INDEX_Likes    |            1 | Likes       | A         |          66 |     NULL | NULL   |      | BTREE      |         |
| Track |          1 | INDEX_Dislikes |            1 | Dislikes    | A         |          27 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Was it helpful?

Solution

This is a very subjective question because it very much depends on your exact requirements, and performance testing which nobody here can do on your data. But I can answer your questions and add some generic solutions that might work for you:


Did I make the database design wrong? I mean, could it be better?

No. This is the ideal design for OLTP.


Did I make the query wrong?

No (Although the ORDER BY in the subqueries are redundant). The performance of your query is very much dependent on the indexes on the Vote table, since the main columns queried will be in this part:

SELECT  V.TrackHash, SUM(V.Vote) AS VotesTotal
FROM    Vote V
WHERE   V.CreatedAt > NOW() - INTERVAL 1 MONTH AND V.Vote = 'up'
GROUP BY V.TrackHash

I would suggest 2 indexes, one on TrackHash and one on CreatedAt, Vote AND Type (this may perform better as 3 separate indexes, worth testing both ways). 200k rows is not that much data, so with the right indexes it shouldn't take too long to query data over the last month.


Anything else I could improve?

This is very much a balancing act, it really depends on your exact requirements as to the best way to proceed. There are 3 main ways you could approach the problem.

1. Your current approach (query vote table each time)

As Mentioned before I think this approach should be scalable for your application. The advantage is it does not require any maintenance, and all data sent to the application is up to date and accurate. The disadvantage is performance, it might take a bit longer to insert data (due to updating indexes), and also select data. This would be my preferred approach.

2. OLAP approach

This would involve maintaining a summary table such as:

CREATE TABLE VoteArchive
(       TrackHash           CHAR(40) NOT NULL,
        CreatedDate         DATE NOT NULL,
        AppMadeUpVotes      INT NOT NULL,
        AppMadeDownVotes    INT NOT NULL,
        ImportedUpVotes     INT NOT NULL,
        ImportedDownVotes   INT NOT NULL,
        MergedUpVotes       INT NOT NULL,
        MergedDownVotes     INT NOT NULL,
    PRIMARY KEY (CreatedDate, TrackHash)
);

This can be populated nightly by running a simple query

INSERT VoteArchive
SELECT  TrackHash,
        DATE(CreatedAt),
        COUNT(CASE WHEN Vote = 'Up' AND Type = 0 THEN 1 END),
        COUNT(CASE WHEN Vote = 'Down' AND Type = 0 THEN 1 END),
        COUNT(CASE WHEN Vote = 'Up' AND Type = 1 THEN 1 END),
        COUNT(CASE WHEN Vote = 'Down' AND Type = 1 THEN 1 END),
        COUNT(CASE WHEN Vote = 'Up' AND Type = 2 THEN 1 END),
        COUNT(CASE WHEN Vote = 'Down' AND Type = 2 THEN 1 END)
FROM    Votes
WHERE   CreatedAt > DATE(CURRENT_TIMESTAMP)
GROUP BY TrackHash, DATE(CreatedAt);

You can then use this table in place of your live data. It has the advantage of the date being part of the clustered index, so any query being limited by date should be very fast. The disadvantage of this is that if you query this table you only get statistics accurate up to the last time it was populated, you will get much faster queries though. It is also additional work to maintain the query. However this would be my second choice if I could nto query live data.

3. Update statistics during voting

I am including this for completeness but would implore you not to use this method. You could achieve this in either your application layer or via a trigger and although it does allow for querying of up to date data without having to query the "production" table it is open for errors, and I have never come accross anyone that truly advocates this approach. For every vote you need to do insert/update logic which should turn a very fast insert query into a longer process, depending on how you do the maintenance there is a chance (albeit very small of concurrency issues).

4. A combination of the above

You could always have 2 tables of the same format as your vote table, and one table as set out in solution 2, have one vote table just for storing today's votes, and one for historic votes, and still maintain a summary table, you can then combine today's data with the summary table to get up to date results without querying a lot of data. Again, this is additional maintenance, and more potential for things to go wrong.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top