Question

I have a table sample with two columns id and cnt and another table PostTags with two columns postid and tagid

I want to update all cnt values with their corresponding counts and I have written the following query:

UPDATE sample SET
cnt = (SELECT COUNT(tagid) 
       FROM PostTags 
       WHERE sample.postid = PostTags.postid 
       GROUP BY PostTags.postid)

I intend to update entire column at once and I seem to accomplish this. But performance-wise, is this the best way? Or is there a better way?

EDIT

I've been running this query (without GROUP BY) for over 1 hour for ~18m records. I'm looking for a query that is better in performance.

Was it helpful?

Solution

That query should not take an hour. I just did a test, running a query like yours on a table of 87520 keywords and matching rows in a many-to-many table of 2776445 movie_keyword rows. In my test, it took 32 seconds.

The crucial part that you're probably missing is that you must have an index on the lookup column, which is PostTags.postid in your example.

Here's the EXPLAIN from my test (finally we can do EXPLAIN on UPDATE statements in MySQL 5.6):

mysql> explain update kc1 set count = 
  (select count(*) from movie_keyword 
   where kc1.keyword_id = movie_keyword.keyword_id) \G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: kc1
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 98867
        Extra: Using temporary
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: movie_keyword
         type: ref
possible_keys: k_m
          key: k_m
      key_len: 4
          ref: imdb.kc1.keyword_id
         rows: 17
        Extra: Using index

Having an index on keyword_id is important. In my case, I had a compound index, but a single-column index would help too.

CREATE TABLE `movie_keyword` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `movie_id` int(11) NOT NULL,
  `keyword_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `k_m` (`keyword_id`,`movie_id`)
);

The difference between COUNT(*) and COUNT(movie_id) should be immaterial, assuming movie_id is NOT NULLable. But I use COUNT(*) because it'll still count as an index-only query if my index is defined only on the keyword_id column.

OTHER TIPS

Remove the unnecessary GROUP BY and the statement looks good. If however you expect many sample.set already to contain the correct value, then you would update many records that need no update. This may create some overhead (larger rollback segments, triggers executed etc.) and thus take longer.

In order to only update the records that need be updated, join:

UPDATE sample
INNER JOIN 
(
  SELECT postid, COUNT(tagid) as cnt
  FROM PostTags 
  GROUP BY postid
) tags ON tags.postid = sample.postid
SET sample.cnt = tags.cnt
WHERE sample.cnt != tags.cnt OR sample.cnt IS NULL;

Here is the SQL fiddle: http://sqlfiddle.com/#!2/d5e88.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top