Question

Was wondering if anyone knew an efficient way to dedupe records in a large dataset using Oracle SQL based on the max values of 2 attributes in conjunction.

In the hypothetical example below, I am looking remove all duplicate COMPANYID / CHILD ID Pairs by selecting first the maximum transactionid. Where the payload ID still has duplicates, the maximum BATCHID.

note: transactionID and batchID may have null values (which would be expected to the lowest value)

Table: Transaction

<p> CompanyID| ChildID | transactionid| BatchID | Product Details </P>
<p> ABC         EFG       306                    Product1 </p>
<p>ABC         EFG       306          54        Product2</p>
<p>ZXY         BFG       405          003       Product1</p>
<p>ZXY         BFG       405          004       Product2</p>
<p>ZXY         BFG       407                    Product3</p>

Expected Result:

<p>ABC | EFG | 306 | 54 | Product 2  --selected on basis of highest transactionid and batchid </P>
<p>ZXY | BFG | 405 | 407 | Product 3 --selected on basis of highest transactionid </p>

I envisioned simply: 1) Using a max function on the transactionid and subquerying the result to max the batchID in addition 2) Self joining the "de-duped' set to the original set to obtain product information

Does anybody know of a more efficient / cleaner way to achieve this and a way to handle the nulls better?

Appreciate any feedback.

Was it helpful?

Solution

From Oracle 11g, you can use this kind of requests:

with w(CompanyID, ChildID, transactionid, BatchID, Product_Details) as
(
  select 'ABC', 'EFG', 306, null, 'Product1 ' from dual
  union all
  select 'ABC', 'EFG', 306, 54, 'Product2' from dual
  union all
  select 'ZXY', 'BFG', 405, 003, 'Product1' from dual
  union all
  select 'ZXY', 'BFG', 405, 004, 'Product2' from dual
  union all
  select 'ZXY', 'BFG', 407, null, 'Product3' from dual
)
select w.CompanyID,
       w.ChildID,
       max(w.transactionid)   keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_transactionid,
       max(w.batchid)         keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_batchid,
       max(w.Product_Details) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_Product_Details
from w
group by w.CompanyID, w.ChildID
;

The nvl function allows you to handle null cases. Here is the output (which does not fit yours, but I did the request as I understood what you wanted):

COMPANYID    CHILDID    MAX_TRANSACTIONID    MAX_BATCHID    MAX_PRODUCT_DETAILS
ABC          EFG        306                  54             Product2
ZXY          BFG        407                                 Product3

EDIT: Let me try to explain further DENSE_RANK and LAST: inside a GROUP BY, this syntax appears as an aggregate function (like SUM, AVG...).

  • In a group, the ORDER BY gives the sorting (here, transactionid and batchid)
  • then the DENSE_RANK LAST states that you will focus on the last ranked row(s) of this sorting (you can have indeed several rows with same rank)
  • the MAX takes the maximum value inside these top-ranked rows. Most of the time, you only have one row so MAX can appear useless, but it is not. So you will often see MIN and DENSE_RANK FIRST, or MAX and DENSE_RANK LAST.

Here is the Oracle doc on this subject.

OTHER TIPS

Because you are dealing with multiple columns, you should also consider just using row_number():

select t.*
from (select t.*,
             row_number() over (partition by CompanyId, ChildId
                                order by transactionid desc nulls last, BatchID desc nulls last
                               ) as seqnum
      from t
     ) t
where seqnum = 1;

The keep/dense_rank method is fast. I'm not sure if doing it multiple times is faster than using row_number(). Testing can give you this information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top