문제

I would like to understand why I have different results

I have a table called active_transfert where I log image transfert

user_id | image_id | created_at
--------|----------|-----------
1       |1         |2014-07-10
1       |2         |2015-01-21
2       |1         |2015-05-23
3       |1         |2016-07-22
4       |6         |2017-06-01
4       |6         |2014-08-22

I would like to return unique user_id per image_id.

SELECT user_id,
       image_id
FROM active_transfert
GROUP BY user_id,
         image_id; --50


SELECT user_id,
       image_id
FROM
  (SELECT user_id,
          image_id,
          rank() OVER (PARTITION BY user_id, image_id
                       ORDER BY created_at DESC) AS i_ranked
   FROM active_transfert) AS i
WHERE i.i_ranked = 1; -- 53

I run those queries against Redshift. Why my second query doesn't prevent duplicated records (same user_id and image_id)?

Expected result :

user_id | image_id |
--------|----------|
1       |1         |
1       |2         |
2       |1         |
3       |1         |
4       |6         |
도움이 되었습니까?

해결책

RANK() is a deterministic function, meaning that duplicates will be labelled with the same rank value. The outputs of your queries suggests to me that there are multiple records with the same user_id and image_id that also have the same created_at value. These records will all return with the same RANK() value.

If you run your inner query, you will see these duplicates where all three attributes are the same. If the created_date is also the largest value for that combination of user_id and image_id they will all have the RANK() value of 1.

To get your desired output, you should instead be using ROW_NUMBER(). This is a non deterministic function when the parameters in the OVER clause do not uniquely determine the rows, which is true in this case. This will assign a unique result to every row, but as your RANK() query provided duplicates it will assign each of these rows a unique value at random.

Your second query using ROW_NUMBER():

SELECT user_id,
       image_id
FROM
  (SELECT user_id,
          image_id,
          ROW_NUMBER() OVER (PARTITION BY user_id, image_id
                       ORDER BY created_at DESC) AS i_ranked
   FROM active_downloads) AS i
WHERE i.i_ranked = 1;
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 dba.stackexchange
scroll top