Question

I have a product by which user can create information say user details, employee details etc. This product/web application is developmed in Spring and hibernate mainly.

Now, when i am selling the product to a company, they are looking for bulk load tools to load say users and company.

In this case, can I go for Spring batch (never used spring batch but heard about it)? because, as i was already using Spring in my applications, I can utilize same coding & business logic implementaion for bulk loading as well.

Or should i go for ETL tool like Pentaho or informatica? In this case, I neede to duplicate my coding & business logic implementaion to go with Pentaho or informatica. If I am changing any logic in core product then I have to make it here as well.

Which is good approach and best one?

My idea would be haing excel file which will have list of users and companies, the Spring batch or Pentaho Kettle will take that as input and process the data and store it in DB and tell the user how many records submitted, got success and got failed.

Please suggest which approach is good and why?

Was it helpful?

Solution 2

As I've tried the both technologies, IMHO using pentaho ETL will be much faster as you will just have to drag and drop steps and configure your input output and processing.For ETL I believe it would require less training than spring batch. I'm java developer myself and I had used pentaho kettle (ETL tool) for the similar requirement sometime back, and now I'm working on spring batch for the similar task and for the task that take 10 minutes in kettle takes around number of hours doing it in spring batch, considering the fact that I was new for both technologies while implementing.

OTHER TIPS

I am using Spring Batch on the job and I have no experience with any ETL tools, so I am biased towards that. However, I think you pretty much answered your own question.

You mention Spring Batch will allow you to re-use your existing business logic (this alone is good enough for me), and obtaining summary statistics (Spring has this functionality by default). It's also my opinion that it will be much easier to find, hire and train Java developers than developers for proprietary ETL software.

The only downside is that you may need to extend the framework in order to be useful. For example, if you are receiving JSON, they currently don't have support for that.

Pasting here good links that might be helpful for others:

http://www.coderanch.com/t/579152/Spring/Spring-Batch-ETL

Spring Batch will not do the parsing for you. You will need to receive the files, process, validate etc.. Look into mule ESB also for automatic triggering on reception of files in certain folders/directories.

Also, for ETL, look at Talend, I believe it's open source and can transform all sorts of files.

http://forum.spring.io/forum/spring-projects/batch/62803-batch-vs-etl

That's a pretty big question, one I've had pretty long and protracted discussions about before, and there isn't a hard and fast rule. I don't claim to be an ETL expert, but I've had familiarity with some of the big guns in the ETL space such as Datastage, etc. While it's easy to agree that in many ways Java Batch processing is similar to ETL (Your assertion of ETL being similar to Read/Process/Write is reasonable) I see it generally used in BI scenarios. In fact, if you look at the Jasper site, it's a component of their full BI stack, and many other ETL providers are the same. I see it used a lot in Data warehousing scenarios, and it works quite well there. Bulk moving and transformation of data is where it shines. Where I've seen issues is when trying to apply complex business logic in between. I don't want to start any kind of religious debate here, this has just been my experience. ETL tools are just that, tools. It almost boils down to packaged vs custom in some ways, which is a debate I don't want to get into at all. However, if you have a company full of Java developers, and much of the business logic is already written in Java for other application styles such as web or integration, it makes a lot of sense to keep the batch application style in the same technology. ETL tools have come a long way in terms of usability, but they're still fairly large and complex tools and learning to use them effectively requires some time. I realize that a the time to learn Spring Batch isn't exactly zero, but I think it's fairly easy to agree that getting a Java person up to speed on a Java framework is going to go better than teaching them to use a tool, we tend to like to code. The cost issue often comes up as well, since ETL is generally not free. I know there are some open source implementations out there, some in Java, but I haven't had experience with using them in large production environments, so I can't comment.

That's about as far as I'm willing to go in a forum post. I think ETL is certainly another tool in the toolbox, which in certain scenarios may overlap with a custom batch solution. The decision on which to use depends upon a lot of factors about your particular scenario.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top