Best practice for checking data existence in database in a loop?

https://stackoverflow.com/questions/4813896

25-10-2019
|

Question

I need to check that a specific data exists in table1 in database or not within a for loop. If it exists then no action and for loop continues, otherwise I should add data to table1.

So, in every iteration, I take a look at database. I Believe that it's time-consuming.

Is there any best practice for doing such these tasks?

Solution

How do you verify existence of a record in your database table? Most likely you match it against a local Id or something.

If this is true, then I'd query the Table and select all Id's, storing them in a Hashtable (Dictionary in .Net). (This might not be practical if your database contains millions of records). Determining whether a record in the table exists now is a simple matter of checking if a key in the Dictionary exists, which is a O(log n) operation and so a lot better than O(n) expensive database roundtrips.

The next thing to think about is how to remember the records you need to add to the table. This depends on whether you may have duplicate records locally that you want to check if they should be added or if they are guaranteed not to contain (local) duplicates.

In the simple case where there are no possible duplicates, just adding them to the Dictionary at the appropriate key and then later querying Dictionary.Values which is O(1) is probably as fast as it gets. If you need the inserts to be really fast because they are massive, consider using SQL Bulk Inserts.

If your table is too large to cache the Id's locally, I'd consider implementing a stored procedure for doing the insert and have the logic that decides whether to actually perform an insert or just do nothing there. This will get rid of the second roundtrip, which is usually pretty expensive.

If your RDBMS implements the SQL Merge command (assuming your using MS SQL Server, it does), I'd insert all data in a temporary table and then Merge it with the target table. This is probably the fastest solution.

OTHER TIPS

How much data, and what SQL implementation can make a big difference here...

For example, with 10 million rows of data, making 10 million (potentially logged) operations, one for each row will take orders of magnitudes longer a than for example:

uploading the same data to a temporary table in a bulk-operation e.g. through bulk-copy API if you're using SQL.
performing a left-outer-join to diff the data
insert the difference in a single batch operation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow