Question

I have familiarities with software automated build tools ( such as Automated Build Studio). Now I am looking at ETL tools.

The one thing crosses my mind is that, I can do anything I can do in ETL tools by using a software build tool. ETL tools are tailored for data loading and manipulation for which a lot of scripts are needed in order to do the job. Software build tool, on the other hand, is versatile enough to do any jobs, including writing scripts to extract, transform and load any data from any format into any format.

Am I right?

Was it helpful?

Solution

It is correct that you can roll-out your own ETL scripts written using a development tool of your preference. Having said that, ETL jobs are frequently large (for a lack of better word) and demand considerable administration and attention to minute details (like programming). ETL tools allow developer to focus on ETL tasks -- as opposed to writing and debugging code, although that's part of it too. There are some open-source tools out there, so you can get a feeling of what an average tool does, before jumping into custom development. For example, more expensive tools provide data lineage, meaning you can (graphically) track every field on a report back to the originating table through all transformations (versions included); after a corporate merger that's quite a task to do.
For example Pentaho has community edition; if you have MS SQL Server, you can get SSIS. Also see if you can find something here.

OTHER TIPS

The benefit of an ETL tool is maximized if you have many processes to build (I like jsf80238's post aboves analogy with hammering in 100 nails). A key benefit of real ETL tools is the metadata they generate and operational support. Writing your scripts in Perl/Ruby/etc is fairly easy, but breaks down when problems need to be tracked down or someone other than the author has to figure out what's wrong.The ability for admin/support staff to quickly see what went wrong is what's worth paying money for. I have used Microsoft's SSIS (2005 - OK) and the latest Pentaho PDI (quite good). The Pentaho ETL GUI is used by business users (without IT support for 99% of the time) at my workplace, and has replaced a tangle of SQL scripts and spreadsheets. Say what you like about the rest of the Pentaho stack, but the ETL component is, in my opinion, excellent "bang for buck".

The whole business of ETL is based on the premise that the source of the data is incompatible with destination data source. And many times, the folks who dump the source data may not be thinking that this data needs to be collected and aggregated. This is why the whole business of ETL is in existent.

A commercial ETL tool will not magically read the source input and transform data according to the rules of the destination database. Rules have to be defined and fed into the ETL tool. Interestingly, many companies offer training!!! on how to use their proprietary scripting language. So it is not always that easy. But for non-programmers, maybe this is the preferred route.

Personally, I think that it is always easier to write a proprietary ETL tool in a language like Perl. Simply write a state-machine algorithm to rip through the source data and convert it to the desired format. I use Perl to FTP into machines, read in the files, transform the data, and then load it into the database. This is always a superior solution and much faster if one is proficient in Perl or similar, or can hire someone who knows Perl.

And one final point, start with the end in mind. Dump your source data in a structured format to help out the analysis group in your company who wants to aggregate and study the. This will make the ETL program easier and faster to develop.

I like Damir Sudarevic's answer and wanted to add that your choice of tool might also depend on how much work you have in front of you. If you have the occasional ETL task and are already familiar with a tool that will allow you to accomplish that task, use the tool you already know (this approach assigns a zero value to learning a new tool, which is perhaps undervaluing new knowledge). If you have a lot of ETL tasks, the up-front investment of learning a new tool might very well pay off. You can use pliers to drive a nail, and if you have only one nail you can use the pliers. If you have to drive 100 nails get yourself a hammer.

You can also do anything ETL tools can do with code. :-)

Both tool categories you mention can be used to solve this problem, but they are optimized for the class of problems they are trying to solve:

  • ETLs tend to come with a library of data manipulation tools (relational calculus, in-line computations, etc.), are optimized to handle large quantities of data, and have job management features (important if this isn't a single one-off data migration).
  • Build tools (for me, Ant comes to mind as a prototypical example) could do similar tasks, but are focused on compilation, file organization and manipulation, and packaging.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top