An efficient way to compare two large data sets in SQL
-
11-12-2019 - |
Question
Currently, I'm comparing two data sets, that contain unique StoreKey/ProductKey
combinations.
The 1st data set has the unique StoreKey/ProductKey
combinations for sales between begin January 2012 and end May 2014 (result = 450K lines). The 2nd data set has the unique StoreKey/ProductKey
combinations, for sales begin June 2014, until today (result = 190K lines).
I'm looking to find the StoreKey/ProductKey
combinations that are in the 2nd set, but not in the 1st set - i.e. new products sold from the beginning of June.
Up until now, I've dumped the two data sets into temp tables, created indexes for both tables on both keys, and used the EXCEPT
statement to find unique items.
What is the most efficient way of comparing such large data sets? Is there a more efficient way of doing this type of large comparison?
Solution
Using EXCEPT is in my opinion the way to go here, but you might want to reconsider the use of the temporary table. By doing so you are effectively duplicating your data in memory, which will slow you down. If the indexes you need exist on the source tables (as I suspect), just compare the appropriate SELECTS:
SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date1 AND date2
EXCEPT
SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date3 AND date4
OTHER TIPS
If you're familiar with algorithms (Big-O complexity), performing this comparison is at best O(n log(n)). The most efficient algorithm will sort both data sets, then make a merged run down them in parallel to find matching (or unmatched) keys. Most RDBMS optimizers will do this automatically for you when you are using EXCEPT
or MINUS
. Your explain plan will confirm or disconfirm. If you see nested loops, you're doing O(n^2), not as efficient.