Question

Currently, I'm comparing two data sets, that contain unique StoreKey/ProductKey combinations.

The 1st data set has the unique StoreKey/ProductKey combinations for sales between begin January 2012 and end May 2014 (result = 450K lines). The 2nd data set has the unique StoreKey/ProductKey combinations, for sales begin June 2014, until today (result = 190K lines).

I'm looking to find the StoreKey/ProductKey combinations that are in the 2nd set, but not in the 1st set - i.e. new products sold from the beginning of June.

Up until now, I've dumped the two data sets into temp tables, created indexes for both tables on both keys, and used the EXCEPT statement to find unique items.

What is the most efficient way of comparing such large data sets? Is there a more efficient way of doing this type of large comparison?

Was it helpful?

Solution

Using EXCEPT is in my opinion the way to go here, but you might want to reconsider the use of the temporary table. By doing so you are effectively duplicating your data in memory, which will slow you down. If the indexes you need exist on the source tables (as I suspect), just compare the appropriate SELECTS:

SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date1 AND date2
EXCEPT
SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date3 AND date4

OTHER TIPS

If you're familiar with algorithms (Big-O complexity), performing this comparison is at best O(n log(n)). The most efficient algorithm will sort both data sets, then make a merged run down them in parallel to find matching (or unmatched) keys. Most RDBMS optimizers will do this automatically for you when you are using EXCEPT or MINUS. Your explain plan will confirm or disconfirm. If you see nested loops, you're doing O(n^2), not as efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top