느린 매개 변수화 된 인서트를 빠른 벌크 사본으로 변경하는 방법 (메모리에서도)

StackOverflow https://stackoverflow.com/questions/127152

문제

나는 내 코드에서 이런 일을했다 (.net 2.0, ms sql)

SqlConnection connection = new SqlConnection(@"Data Source=localhost;Initial
Catalog=DataBase;Integrated Security=True");
  connection.Open();

  SqlCommand cmdInsert = connection.CreateCommand();
  SqlTransaction sqlTran = connection.BeginTransaction();
  cmdInsert.Transaction = sqlTran;

  cmdInsert.CommandText =
     @"INSERT INTO MyDestinationTable" +
      "(Year, Month, Day, Hour,  ...) " +
      "VALUES " +
      "(@Year, @Month, @Day, @Hour, ...) ";

  cmdInsert.Parameters.Add("@Year", SqlDbType.SmallInt);
  cmdInsert.Parameters.Add("@Month", SqlDbType.TinyInt);
  cmdInsert.Parameters.Add("@Day", SqlDbType.TinyInt);
  // more fields here
  cmdInsert.Prepare();

  Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);

  StreamReader reader = new StreamReader(stream);
  char[] delimeter = new char[] {' '};
  String[] records;
  while (!reader.EndOfStream)
  {
    records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);

    cmdInsert.Parameters["@Year"].Value = Int32.Parse(records[0].Substring(0, 4));
    cmdInsert.Parameters["@Month"].Value = Int32.Parse(records[0].Substring(5, 2));
    cmdInsert.Parameters["@Day"].Value = Int32.Parse(records[0].Substring(8, 2));
    // more here complicated stuff here
    cmdInsert.ExecuteNonQuery()
  }
  sqlTran.Commit();
  connection.Close();

와 함께 cmdinsert.executenonquery () 이 코드는 2 초 미만으로 실행했습니다. SQL 실행을 사용하면 1m 20 초가 걸립니다. 약 0.5 마일 레코드가 있습니다. 테이블은 전에 비워졌습니다. 유사한 기능의 SSIS 데이터 흐름 작업은 약 20 초가 걸립니다.

  • 벌크 삽입 ~였다 옵션이 아닙니다 (아래 참조). 나는이 수입 중에 멋진 일을했다.
  • 내 테스트 머신은 2GB RAM이있는 Core 2 Duo입니다.
  • 작업 관리자를 볼 때 CPU는 완전히 최신 상태가 아닙니다. IO는 또한 완전히 활용되지 않은 것처럼 보였다.
  • 스키마는 지옥처럼 간단합니다. 1 차 인덱스로 자동 점을 가진 하나의 테이블과 10 개의 int, 작은 int 및 chars (10).

여기서 몇 가지 답변을 한 후에는 실행할 수 있음을 알았습니다. 메모리에서 벌크 사본! 나는 대량 사본 Beacuse 사용을 거부하고 있었다. 나는 그것이 파일에서 완료해야한다고 생각했다 ...

이제 나는 이것을 사용하고 Aroud 20 초 (SSIS 작업과 같은)가 필요합니다.

  DataTable dataTable = new DataTable();

  dataTable.Columns.Add(new DataColumn("ixMyIndex", System.Type.GetType("System.Int32")));   
  dataTable.Columns.Add(new DataColumn("Year", System.Type.GetType("System.Int32")));   
  dataTable.Columns.Add(new DataColumn("Month", System.Type.GetType("System.Int32")));
  dataTable.Columns.Add(new DataColumn("Day", System.Type.GetType("System.Int32")));
 // ... and more to go

  DataRow dataRow;
  object[] objectRow = new object[dataTable.Columns.Count];

  Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);

  StreamReader reader = new StreamReader(stream);
  char[] delimeter = new char[] { ' ' };
  String[] records;
  int recordCount = 0;
  while (!reader.EndOfStream)
  {
    records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);

    dataRow = dataTable.NewRow();
    objectRow[0] = null; 
    objectRow[1] = Int32.Parse(records[0].Substring(0, 4));
    objectRow[2] = Int32.Parse(records[0].Substring(5, 2));
    objectRow[3] = Int32.Parse(records[0].Substring(8, 2));
    // my fancy stuf goes here

    dataRow.ItemArray = objectRow;         
    dataTable.Rows.Add(dataRow);

    recordCount++;
  }

  SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
  bulkTask.DestinationTableName = "MyDestinationTable"; 
  bulkTask.BatchSize = dataTable.Rows.Count;
  bulkTask.WriteToServer(dataTable);
  bulkTask.Close();
도움이 되었습니까?

해결책

각 레코드를 개별적으로 삽입하는 대신 sqlbulkcopy 클래스 대량으로 모든 레코드를 한 번에 삽입하십시오.

데이터 가능성을 만들고 모든 레코드를 데이터 테이블에 추가 한 다음 사용하십시오. sqlbulkcopy.WriteToserver 모든 데이터를 한 번에 삽입하려면 대량 삽입.

다른 팁

거래가 필요합니까? 트랜잭션을 사용하려면 간단한 명령보다 훨씬 더 많은 리소스가 필요합니다.

또한 확실하다면, 삽입 된 값이 Corect라고 말하면 Bulkinsert를 사용할 수 있습니다.

1 분은 0.5 백만 레코드에 대해 꽤 합리적으로 들립니다. 그것은 0.00012 초마다 기록입니다.

테이블에 색인이 있습니까? 벌크 인서트 후에 이들을 제거하고 다시 적용하면 옵션 인 경우 인서트의 성능을 향상시킵니다.

It doesn't seem unreasonable to me to process 8,333 records per second...what kind of throughput are you expecting?

If you need better speed, you might consider implementing bulk insert:

http://msdn.microsoft.com/en-us/library/ms188365.aspx

If some form of bulk insert isn't an option, the other way would be multiple threads, each with their own connection to the database.

The issue with the current system is that you have 500,000 round trips to the database, and are waiting for the first round trip to complete before starting the next - any sort of latency (ie, a network between the machines) will mean that most of your time is spent waiting.

If you can split the job up, perhaps using some form of producer/consumer setup, you might find that you can get much more utilisation of all the resources.

However, to do this you will have to lose the one great transaction - otherwise the first writer thread will block all the others until its transaction is completed. You can still use transactions, but you'll have to use a lot of small ones rather than 1 large one.

The SSIS will be fast because it's using the bulk-insert method - do all the complicated processing first, generate the final list of data to insert and give it all at the same time to bulk-insert.

I assume that what is taking the approximately 58 seconds is the physical inserting of 500,000 records - so you are getting around 10,000 inserts a second. Without knowing the specs of your database server machine (I see you are using localhost, so network delays shouldn't be an issue), it is hard to say if this is good, bad, or abysmal.

I would look at your database schema - are there a bunch of indices on the table that have to be updated after each insert? This could be from other tables with foreign keys referencing the table you are working on. There are SQL profiling tools and performance monitoring facilities built into SQL Server, but I've never used them. But they may show up problems like locks, and things like that.

Do the fancy stuff on the data, on all records first. Then Bulk-Insert them.

(since you're not doing selects after an insert .. i don't see the problem of applying all operations on the data before the BulkInsert

If I had to guess, the first thing I would look for are too many or the wrong kind of indexes on the tbTrafficLogTTL table. Without looking at the schema definition for the table, I can't really say, but I have experienced similar performance problems when:

  1. The primary key is a GUID and the primary index is CLUSTERED.
  2. There's some sort of UNIQUE index on a set of fields.
  3. There are too many indexes on the table.

When you start indexing half a million rows of data, the time spent to create and maintain indexes adds up.

I will also note that if you have any option to convert the Year, Month, Day, Hour, Minute, Second fields into a single datetime2 or timestamp field, you should. You're adding a lot of complexity to your data architecture, for no gain. The only reason I would even contemplate using a split-field structure like that is if you're dealing with a pre-existing database schema that cannot be changed for any reason. In which case, it sucks to be you.

I had a similar problem in my last contract. You're making 500,000 trips to SQL to insert your data. For a dramatic increase in performance, you want to investigate the BulkInsert method in the SQL namespace. I had "reload" processes that went from 2+ hours to restore a couple of dozen tables down to 31 seconds once I implemented Bulk Import.

This could best be accomplished using something like the bcp command. If that isn't available, the suggestions above about using BULK INSERT are your best bet. You're making 500,000 round trips to the database and writing 500,000 entries to the log files, not to mention any space that needs to be allocated to the log file, the table, and the indexes.

If you're inserting in an order that is different from your clustered index, you also have to deal with the time require to reorganize the physical data on disk. There are a lot of variables here that could possibly be making your query run slower than you would like it to.

~10,000 transactions per second isn't terrible for individual inserts coming roundtripping from code/

BULK INSERT = bcp from a permission

You could batch the INSERTs to reduce roundtrips SQLDataAdaptor.UpdateBatchSize = 10000 gives 50 round trips

You still have 500k inserts though...

Article

MSDN

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top