Вопрос

I have about 1.5 million files I need to insert records for in the database. Each record is inserted with a key that includes the name of the file.

The catch: The files are not uniquely identified currently.

So, what we'd like to do is, for each file:

  • Insert a record. One of the fields in the record should include an amazon S3 key which should include the ID of the newly inserted record.
  • Rename the file to include the ID so that it matches the format of the key.

The best thing I can think to do is:

  • Run an individual insert command that returns the ID of the added row.
  • Add that back as a property to the individual business object I'm looping through.
  • Generate an update statement that updates the S3 key to include the ID
  • Output the file, concatenate the ID into the end the file name.

As I can tell, that looks to be :

  • 1.5 million insert statements
    • individual SqlCommand executions and read because we need the ID back),
  • 1.5 million times setting a property on an object.
  • 1.5 million update statements generated and executed
    • Perhaps could make this a one giant concatenated update statement to do them all at once; not sure if that helps
  • 1.5 million file copies.

I can't get around the actual file part, but for the rest, is there a better strategy I'm not seeing?

Это было полезно?

Решение

If you make the client application generate the IDs you can use a straight-forward SqlBulkCopy to insert all rows at once. It will be done in seconds.

If you want to keep the IDENTITY property of the column, you can run a DBCC CHECKIDENT(RESEED) to advance the identity counter by 1.5m to give you a guaranteed gap that you can insert into. If the number of rows is not statically known you can perform the inserting in smaller chunks of maybe 100k until you are done.

Другие советы

You will cut the number of SQL statements in half up by not relying on the database to generate your ID for each row. Do everything locally (including the assignment of an ID) and then do a single batch of inserts at the end, with identity_insert on.

This will cause SQL Server to use your ID's for this batch of records.

If this is still too slow (and 1.5 million inserts might be), the next step would be to output your data to a text file (XML, comma delimited, or whatever) and then do a bulk import operation on the file.

That's as fast as you will be able to make it, I think.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top