Domanda

I got this table

CREATE TABLE `votes` (
  `item_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `vote` tinyint(4) NOT NULL DEFAULT '0',
  PRIMARY KEY (`item_id`,`user_id`),
  KEY `FK_vote_user` (`user_id`),
  KEY `vote` (`vote`),
  KEY `item` (`item_id`),
  CONSTRAINT `FK_vote_item` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON UPDATE CASCADE,
  CONSTRAINT `FK_vote_user` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

And I got this simple select

SELECT 
  `a`.`item_id`, `a`.`sum`
FROM
  (SELECT 
    `item_id`, SUM(vote) AS `sum` 
  FROM
    `votes` 
  GROUP BY `item_id`) AS a 
ORDER BY `a`.`sum` DESC
LIMIT 10

Right now, with only 250 rows, there isn't a problem, but it's using filesort. The vote column has either -1, 0 or 1. But will this be performant when this table has millions or rows?

If I make it a simpler query without a subquery, then the using temporary table appears.

Explain gives (the query completes in 0.00170s):

id select_type table      type  possible_keys key     key_len ref  rows Extra
1  PRIMARY     <derived2> ALL   NULL          NULL    NULL    NULL 33   Using filesort
2  DERIVED     votes      index NULL          PRIMARY 8       NULL 250
È stato utile?

Soluzione

No, this won't be efficient with millions of rows.

You'll have to create a supporting aggregate table which would store votes per item:

CREATE TABLE item_votes
        (
        item_id INT NOT NULL PRIMARY KEY,
        votes UNSIGNED INT NOT NULL,
        upvotes UNSIGNED INT NOT NULL,
        downvotes UNSIGNED INT NOT NULL,
        KEY (votes),
        KEY (upvotes),
        KEY (downvotes)
        )

and update it each time a vote is cast:

INSERT
INTO    item_votes (item_id, votes, upvotes, downvotes)
VALUES  (
        $item_id,
        CASE WHEN $upvote THEN 1 ELSE -1 END,
        CASE WHEN $upvote THEN 1 ELSE 0 END,
        CASE WHEN $upvote THEN 0 ELSE 1 END
        )
ON DUPLICATE KEY
UPDATE
SET     votes = votes + VALUES(upvotes) - VALUES(downvotes),
        upvotes = upvotes + VALUES(upvotes),
        downvotes = downvotes + VALUES(downvotes)

then select top 10 votes:

SELECT  *
FROM    item_votes
ORDER BY
        votes DESC, item_id DESC
LIMIT   10

efficiently using an index.

Altri suggerimenti

But will this be performant when this table has millions or rows?

No, it won't.

If I make it a simpler query without a subquery, then the using temporary table appears.

Probably because the planner would turn it into the query you posted: it needs to calculate the sum to return the results in the correct order.

To quickly grab the top voted questions, you need to cache the result. Add a score field in your items table, and maintain it (e.g. using triggers). And index it. You'll then be able to grab the top 10 scores using an index scan.

First, you don't need the subquery, so you can rewrite your query as:

SELECT `item_id`, SUM(vote) AS `sum` 
FROM `votes`
GROUP BY `item_id`
ORDER BY `a`.`sum` DESC
LIMIT 10

Second, you can build an index on votes(item_id, vote). The group by will then be an index scan. This will take time as the table gets bigger, but it should be manageable for reasonable data sizes.

Finally, with this structure of a query, you need to do a file sort for the final order by. Whether this is efficient or not depends on the number of items you have. If each item has, on average, one or two votes, then this may take some time. If you have a fixed set of items and there are only a few hundred or thousand, then then should not be a performance bottleneck, even as the data size expands.

If this summary is really something you need quickly, then a trigger with a summary table (as explained in another answer) provides a faster retrieval method.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top