Matching only one specific row in a JOIN where many exist

https://stackoverflow.com/questions/1472528

16-09-2019
|

Question

(Advantage Database Server) I have a table of service providers that, for auditing purposes, are never deleted. They have a start date and end date; in the case of changes like name or address, the existing row is end dated, a new row is created, and a new start date is assigned for the changed data.

During processing of payments to those providers, I need a summary page that lists the provider name, address, identifier (ProvID), and total amount being paid. This is done in a fairly straightforward query with a SUM() and GROUP BY.

The problem appears when there are two or more rows for a specified provider identifier. I end up with duplicate rows (which could result in multiple payments to that provider if not caught).

My first thought was to use something (ugly, but performs reasonably quickly) like a subselect:

SELECT ... FROM service s
INNER JOIN provider p ON p.ProvID = s.ProvID
AND (p.EndDate IS NULL or p.EndDate = (SELECT Max(EndDate) FROM
   provider lu WHERE lu.ProvID = s.ProvID))

Unfortunately, this still ended up finding two rows; one row for the NULL EndDate and one for the MAX(EndDate).

I handle this in other cases (eg., locating the proper ProvID for a service provided on a specific date) using

p.EndDate is null or (s.ServiceDate BETWEEN p.StartDate AND p.EndDate)

Unfortunately, since the problem query is a GROUP BY with an aggregate, the service date isn't available.

Any suggestions?

EDIT: What I'm looking for is either the row with the NULL EndDate if it exists, OR the row with the Max(EndDate) if the NULL row doesn't exist. This covers the case, for instance, where a supplier was terminated yesterday, but did work last week, and we'll be paying them next week.

Solution

So I guess if there is a row with NULL end date, you want that one, otherwise you want the one with the largest end date?

I'm not sure about ADS, but the following would work on SQL Server:

SELECT ... FROM service s
INNER JOIN provider p ON p.ProvID = s.ProvID
AND (COALESCE(p.EndDate, '2037-01-01') = (
   SELECT Max(COALESCE(EndDate, '2037-01-01')) FROM
   provider lu WHERE lu.ProvID = s.ProvID)
)

The COALESCE operator returns the first non-null parameter, so this is basically just setting the nulls to a time far in the future, so that SELECT MAX will give you the one with the NULL end date if there is one.

OTHER TIPS

in the 2nd condition, you have to get the max only if there is no NULL EndDate

SELECT ... FROM service s
INNER JOIN provider p ON p.ProvID = s.ProvID
AND (   p.EndDate IS NULL 
     or (p.EndDate = (SELECT Max(EndDate) 
                        FROM provider lu 
                       WHERE lu.ProvID = s.ProvID)
         AND NOT EXISTS (SELECT NULL 
                           FROM provider lu 
                          WHERE lu.ProvID = s.ProvID 
                            AND lu.EndDate IS NULL)
        )
    )

Perhaps use a subquery in place of the second table:

SELECT ... FROM service s
INNER JOIN (SELECT ..., Max(EndDate) FROM
   provider lu WHERE lu.ProvID = s.ProvID GROUP BY ...) p ON p.ProvID = s.ProvID

This is assuming you'd get NULL back if there is no max enddate.

What you are referring to is a Type-2 Dimension of a Data Warehouse.

You have to join by the ID and by the StartDate and EndDate to get the proper data.

OTTOMH code

SELECT TransactionId, TransactionType
FROM TransactionList Tx
    INNER JOIN TransactionType TxType
        ON Tx.TransactionTypeId = TxType.TxTypeId
        AND Tx.TransactionDate Between TxType.StartDate and TxType.EndDate

What in your provider table denotes the current date? EndDate=NULL, EndDate=Max(EndDate) or EndDate='9999-01-01'? All three are valid choices, but this should really be unambiguous, since if it's not you're going to end up with duplicate rows in queries all the time, no matter how cleverly you craft this particular query. So i suggest fixing that in the provider table, and then something like this should work:

select p.name, p.address, p.id, sum(s.amount)
  from provider p
  join service s on p.id=s.provider_id
where p.endDate is NULL
group by p.name, p.address, p.id

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow