In TSQL, how can I merge multiple sets of user data to get a data set with one row per user containing all their data?

https://stackoverflow.com/questions/15237281

18-03-2022
|

Question

Suppose I have ten tables of user data (in 10 different systems), and any user could have a record in zero or more of those 10 tables.

What would a join query look like to merge the data from all 10 tables, so that it would produce a result set with one row per user, with columns from all 10 tables. For each user's record, columns for tables that had no data for that user would be null.

The problem I'm encountering is that after performing a full join on table 1 (t1) and table 2 (t2) on username, I'm left with a result set where either t1.Username or t2.Username could be null (for example if a user had a record only in t1, but not t2, or vice versa). Since either username could be null, which username field should I join with t3 (and subsequent tables) without writing a complicated "on" clause with "or" conditions.

I'm afraid the only way to do this cleanly is to COALESCE the usernames after each join, and join each subsequent table to the coalesced username field of the preceding result set.

My first select would look like this:

select coalesce(t1.username,t2.username) as U2, t1.*, t2.*
from t1 full outer join t2 on t1.Username = t2.Username

I would then have to join t3 to that result set on t3.Username = U2. But then, before I could join t4, I would have to coalesce t3.Username and U2 to get U3, so I could join t4.Username on U3, and so on. That would seem to require that the first select statement be a subquery within a query that selects a new coalesced username, and so on for each additional table. The final form of the query would seem to necessarily be a nested series of subqueries. Is that how it must be, or is there another way to do this?

What I don't want to do is a series of left joins against a unique username list that I generate up front by unioning the usernames of all 10 tables. While that would work and is a very clean single-level query, the generation of each of those 10 tables is expensive, so I don't want to be generating them up front just to pull the unique usernames like the answer here: https://stackoverflow.com/a/9233478/88409

I also saw another discussion here: http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1110b&L=sas-l&P=1445 which shows 4 different versions of how it could be done (the first of which avoids subqueries), but it uses a coalesce in the where clause that increases in size with each additional table.

Solution

Generate each of the 10 sets into a table variable (basically a cache). Then use the technique from the other question on the table variables. I recommend declaring a primary key on each of the table variables to make merging them quicker.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow