Question

I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.

I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.

Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?

Thanks

Was it helpful?

Solution

Seeing that you're using SQL Server 2008, I would recommend this approach:

  • first bulkcopy your CSV files into a staging table
  • update your target table from that staging table using the MERGE command

Check out the MSDN docs and a great blog post on how to use the MERGE command.

Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when

  • the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
  • the row from the source doesn't exist in the target --> typically a case for an INSERT

You would have a MERGE statement something like this:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
  INSERT (list OF fields)
  VALUES (list OF values)

WHEN MATCHED THEN
  UPDATE
    SET (list OF SET statements)
;

Of course, the ON clause can be much more involved if needed. And of course, your WHEN statements can also be more complex, e.g.

WHEN MATCHED AND (some other condition) THEN ......

and so forth.

MERGE is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!

OTHER TIPS

Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to have a company built import tool that loops through records, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurrence when it was built but which is a many times a day occurrence now).

I see two options: First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record already exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't have a corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).

I think 2008 also has a merge statement but I have not yet had a chance to use it. Look it up in books online.

The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.

One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.

Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...

Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...

Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?

When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top