Question

This is related to a question I asked previously for which lag/lead was suggested. However the data I'm working with are more complex than I first thought so I need a more robust solution. This screen shot shows an issue I need to tackle:

enter image description here

Within a single serial number, a shipment event defines a new reference window. So records 2,3,4 relate to 1. Record 6 relates to 5 and so forth. I need to mark the records for which the BillToId doesn't match the parent shipment.

I'm trying to understand if I could even use the LAG function to compare records 2,3,4 back to 1 when the number of post-shipment events varies (duplicates are allowed). I was thinking I might be better off with another fact table that identifies the parent rowid along each record first?

So then my question becomes how do I efficiently identify which shipment each row belongs to? Am I forced to run a subquery for each record? I'm working right now with over 2 million total rows. I would later make this query part of the ETL process so it would be processing smaller chunks of data.

Was it helpful?

Solution

Here is an approach that uses the cumulative sum functionality in SQL Server. The idea is to assign each "ship" activity a value of "1" and "0" for everything else. Then do a cumulative sum to identify each group that should have the same billtoid. After that, the ship information can be assigned to all records in the same group:

select rowid, dateid, billtoid, activitytypeid, serialnumber
from (select t.*,
             max(case when activitytypeid = 'Ship' then billtoid end) over
                  (partition by serialnumber, cumships) as ship_billtoid
      from (select t.*,
                   sum(case when activitytypeid = 'Ship' then 1 else 0 end) over
                       (partition by serialnumber order by rowid) as cumships
            from t
           ) t
     ) t
where billtoid <> ship_billtoid;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top