Question

I am curious, what has a better performance. A little bit about the context : I am processing a CSV file that needs to be entered into a database table and need to ensure that I smell out the duplicates. So in order to achieve this. There are two strategies to this:

a. Check each row against the DB if it one of it's columns is a duplicate
b. Collect all rows and then check if any of these are duplicatess

Essentially for a.

SELECT count(*) FROM table WHERE UniqueColumn = $uniqueColumnFromCSV

and for b:

SELECT UniqueColumn FROM table 
 WHERE UniqueColumn in ($uniqueColumn1FromCSV,$uniq....,$uniqueColumn2FromCSV);

The above will give me an array of emails that are present in the db table, which I can use to filter out the emails in my $csvLines[].

I am in favor of b, since it makes 1 DB call and does not stop the CSV file reading mechanism by processing each line. Then again, since the 2nd call is checking the entire DB table for the existence of multiple records, I am not quite convinced.

For the sake of discussion, we can ignore the CSV part. What I am really interested in knowing is the performance of making 1000 calls to the DB to check if a uniqueColumn value is present vs making 1 call to the DB to check which of the uniqueColumns are duplicates

Was it helpful?

Solution

I am in favor of b, since it makes 1 DB call

Your instincts are correct.

Then again, since the 2nd call is checking the entire DB table for the existence of multiple records, I am not quite convinced.

Both methods are searching for exactly the same rows, so there is no difference there1. The difference is that for the first method, the price of the database round-trip will be paid once per row, and for the second method only once regardless of the number of rows.

What I am really interested in knowing is the performance of making 1000 calls to the DB to check if a uniqueColumn value is present vs making 1 call to the DB to check which of the uniqueColumns are duplicates.

I suggest you measure for precise results, but I'd expect 1 "big" query will be significantly faster than the 1000 "small" queries.


1 Actually there might be, in favor of the big query, if your DBMS can parallelize the query execution.

OTHER TIPS

RDBMSes are optimized for set operations, therefore IMHO it's always better (faster) to make one call that deals with the whole data set, than making 1000 calls.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top