How to use for each loop to help load large dataset

https://stackoverflow.com/questions/10620884

09-06-2021
|

문제

I'm trying to load a large dataset from SQL Server 2008 in SSIS. However, it's too slow for Visual Studio load everything at once. Then I decide to use for-each loop to just load part of the table each time.

E.g. If there are 10 million records, I wish I could just load 1 million each time and run for 10 times to complete processing.

This is just my "brain design" and I've no idea how to make it with Foreach Loop component. Is there any other approach to deal with a large dataset?

해결책

The best way in my opinion is to functionally partition your data. A date column is in most cases appropriate to do this. Let's take an order date as an example.

For that column, find the best denominator, for example each year of your order date produces about a million rows.

Instead of a for each loop container, use a for loop container.

To make this loop work, you'll have to find the minimum and maximum year of all order dates in your source data. These can be retrieved with SQL statements that save their scalar result into SSIS variables.

Next, set up your for loop container to loop between the minimum and maximum year that you stored in variables earlier, adding one year per iteration.

Lastly, to actually retrieve your data, you'll have to save your source SQL statement as an expression in a variable with a where clause that specifies the current year of produced by your for loop container:

"SELECT * FROM transactions WHERE YEAR(OrderDate) = " + @[User::ForLoopCurrentYear]

Now you can use this variable in a data flow source to retrieve your partitioned data.

Edit:

A different solution using a for each loop container would be to retrieve your partition keys with a Execute SQL Task and saving that resultset in a SSIS variable of type Object:

SELECT YEAR(OrderDate) FROM transaction GROUP BY YEAR(OrderDate)

With a for each loop container you can loop through the object using the ADO enumerator and use the same method as above to inject the current year into your source SQL statement.

다른 팁

So many variables to cover and I have a meeting in 5 minutes.

You say it's slow. What is slow? Without knowing that, you could be spending forever chasing the wrong rabbit.

SSIS took the crown in 2008 for the ETL processing speed by loading 1TB in 30 minutes. Sure, they tweaked the every loving bejesus out of the system to get it to do so but they lay out in detail what steps they took.

10M rows, while sounding large, is nothing I'd consider taxing to SSIS. To start, look at your destination object (assume OLEDB). If it doesn't have the Fast Load option checked, you are issuing 10M single insert statements. That is going to swamp your transaction log. Also look at the number of rows in your commit size. 0 means all or nothing which may or may not be the right decision based on your recoverability but do realize the implication that holds for your transaction log (it's going to eat quite a bit of space).

What transformation(s) are you applying to the data in the pipeline? There are transforms that will kill your throughput (sort, aggregation, etc)

Create a baseline package, all it does is read N rows of data from the source location and performs a row count. This would be critical to understanding the best theoretical throughput you could expect given your hardware.

Running a package in Visual Studio/BIDS/SSDT is slower, sometimes by an order of magnitude than the experience you will receive from invocation through SQL Agent/dtexec as it does not wrap the execution in a debugger.

I'll amend this answer as I have time but those are some initial thoughts. I'll post on using foreach loop task to process discrete chunks of data after the meeting.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow