I am using a simple stored procedure to fetch some data from a database which works fine so far.

Is there a way in SQL that I can count how often each item appears in the results of my select and then to remove the duplicates, e.g. looking at the column "url" ? Basically I want to add to each row of my select results and then ideally remove the duplicates.

Example: My unfiltered result would be: url1, url1, url1, url2, url2, url3. What I would like to see instead is then: url1 3 url2 2 url3 1

My stored procedure:

**ALTER PROCEDURE [dbo].[CountQueue]
AS
BEGIN
SET NOCOUNT ON;
SELECT      dateEsc,
            url,
            EID
FROM        QueueLog
WHERE       logStatus = 'New'
AND         region = 'US'
AND         (
                flag = 'flag1' 
                OR 
                flag = 'flag2'
            )
ORDER BY    dateEsc desc, EID desc
END**

Many thanks for any help with this Tim

有帮助吗?

解决方案

You can do this in a query, you don't have to use a stored procedure. If I understand you correctly, you can use "group by" to solve the problem.

SELECT      url,
            count(*)
FROM        QueueLog
WHERE       logStatus = 'New'
AND         region = 'US'
AND         (
            flag = 'flag1' 
            OR 
            flag = 'flag2'
            )
GROUP BY url;

If you want to get only the urls that have duplicates, you can add a having:

SELECT      url,
            count(*)
FROM        QueueLog
WHERE       logStatus = 'New'
AND         region = 'US'
AND         (
            flag = 'flag1' 
            OR 
            flag = 'flag2'
            )
GROUP BY url
HAVING count(*) > 1;

My favorite way to delete duplicates involves using windowing functions. Either way, to delete duplicates you have to know which duplicate you want to delete. I'm assuming you want to delete the one with the newer dateEsc. This query here (or something like it) should give you all of the duplicate rows. After you've verified that they're right, it's not hard to change it from a select to a delete.

SELECT * FROM 
(
SELECT      EID,
            dateEsc,
            url,
            rank() OVER(PARTITION BY url ORDER BY dateEsc) as rank
FROM        QueueLog
WHERE       logStatus = 'New'
AND         region = 'US'
AND         (
            flag = 'flag1' 
            OR 
            flag = 'flag2'
            )
) a
WHERE a.rank > 1;

Basically, the inner query takes all rows with the same url and gives them a rank based the dateEsc. So the one with the oldest dateEsc would get a "1" in the rank column, the next oldest would get the rank 2, and so on. Then we know we want to keep the one with rank 1-- the duplicates will be anything with rank 2 or higher. So we select those rows in the outer query. If you want to change entry is the "correct one", just change rank() OVER(PARTITION BY url ORDER BY dateEsc) as rank to rank() OVER(PARTITION BY url ORDER BY EID) as rank or such.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top