Map multiple columns from multiple files which are slightly different

https://stackoverflow.com/questions/12754233

05-07-2021
|

Question

I am looking for a good practical method of tackling metadata normalization between multiple files that have slightly different schema's for a batch ETL job in Talend.

I have a few hundred historical reports (around 25K to 200K records each) with about 100 to 150 columns per excel file. Most of the column names are the same for all the files (98% overlap) however there are subtle evil differences:

Different Column orders
Different Column names (sometimes using and sometimes not using abbreviations)
Different counts of columns
Sometimes columns have spaces between words, sometimes, dots, dashes or underscores
etc.

Short of writing a specialized application or brute forcing all the files by manually correcting them, are there any good free tools or methods that would provide a diff and correction between file column names in an intelligent or semi-automated fashion?

Solution

You could use Talend Open Studio to achieve that. But I do see one caveat.

The official way

In order to make Talend understand your Excel files, you will need to first load it's metadata. The caveat is that you will need to load all metadata by hand (one by one). In the free version of Talend (Open Studio Data), there is no support for dynamic metadata.
Using components like tMap you can then map your input metadata into your desired output metadata (could be a Excel file or a Database or something else). During this step you can shape your input data into your desired output (fixing / ignoring / transforming it / etc).

The unofficial way

There seems to exist a user contributed component that offers support the Excel dynamic metadata. I did not test it, but it worth trying : http://www.talendforge.org/exchange/?eid=663&product=tos&action=view&nav=1,1,1

This can evolve as components are released, updated frequently. My answer is about the status as it is on version 5.3.1

OTHER TIPS

I write this tentatively as an "answer" because I dont have the link to hand to demonstrate how exactly it can be done. However Pentaho data integration provides a very good way to load files like this - There is a method by which you can read the metadata of the file in the first transformation, by that I mean the column names, and you can then use the "metadata injection" functionality to inject that metadata into the next transformation which reads the file.

Now; In the scenario where your column names are slightly different, youll have to somehow do some additional mapping. perhaps you can store a lookup table somewhere of "alias" column name and real column name.

Either way, this sounds like a pretty complex / nasty task to automate!

I've not seen any way to handle varying metadata of a file in Talend - Although happy to be corrected on this point!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow