Question

In My POC, am trying to implement an ETL data flow (star schema) using pig script, As you all know before loading in to fact table i would like to load dimension. Here in dimension i need to load only the new records from source(csv file), I mean records which is not there in dimension(sql server). All joins(skewed,replicate & merge join) in pig are trying to match the existing records and produce only matched records. Can you please tell me how to bring the unmatched record as an output in order to load in to my dimension?

Thanks Selvam

Was it helpful?

Solution

Do a left outer join of source (csv file) with that of dimension(sql server) table. Resultant records that have the join column as null are the new records. Then filter out records whose value of the join column is null.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top