Am attempting to design an Ab Initio load process without any Ab Initio training or documentation. Yeah I know. A design decision is: for the incoming data files there will be inserts and updates. Should I have the feed provider split them into to data files (1 - 10 GB in size nightly) and have Ab Initio do inserts and updates separately?

A problem I see with that, is data isnt always what you expect it to be... And an Insert row may be already present (perhaps purge failed or feed provider made a mistake) Or UPdate row isnt present.

So I'm wondering if I should just combine all inserts and updates... and use Oracle Merge statement (after parallel loading the data into a staging table with no index of course)

But I don't know if AbInitio supports Merge or not.

There is not much for ab initio tutorials or docs on web... can you direct me to anything good?

有帮助吗?

解决方案 2

I would certainly not rely on a source system to tell me whether rows are present in the target table or not. My instinct says to go for a parallel, nologging (if possible), compress (if possible) load into a staging table followed by a merge -- if Ab-Initio does not support Merge then hopefully it supports a call to a PL/SQL procedure, or direct execution of a SQL statement.

If this is a large amount of data I'd like to arrange hash partitioning on the join key for the new and current data sets too.

其他提示

The solution which you just depicted (inserts and updates in a staging table and then merging the content in the main table) is feasible.

A design decision is: for the incoming data files there will be inserts and updates.

I don't know the background of this decision but you should know that this solution will result in longer execution time. In order to execute inserts and updates you have to use the "Update Table" component which is slower than a simpler "Output Table" component. By the way don't use the same "Update Table" component for inserts and updates simultaneously. Use a separate "Update Table" for inserts and another one for updates instead (you'll experience dramatic performance boost in this way). (If you can change the above mentioned design decision then use an "Output Table" instead.)

In either case set the "Update Table"/"Output Table" components to "never abort" so that your graph won't fail if the same insert statement occurs twice or if there's no entry to update on.

Finally the "oracle merge" statement should be fired/executed from a "Run SQL" component when the processing of all the inserts and updates are finished. Use phases to make sure it happens this way...

If you intend to build a graph with parallel execution then make sure that the insert and update statements for the same entries will be processed by the same partitions. (Use the primary key of the final table as the key in the "partition by key" component.)

If you want to have an overview of how many duplicated inserts or wrong updates occur in your messy input then use the "Reject" (and eventually "Error") port of the appropriate "Update Table"/"Output Table" components for further processing.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top