Stored procedures vs JDO for data warehousing project
In the old days we used to access the database through stored procedures. They were seen as `the better' way of managing the data. We keep the data in the database, and any language/platform can access it through JDBC/ODBC/etc.
However, in recent years run-time reflection/meta-data based storage retrieval mechanisms such as Hibernate/DataNucleus have become popular. Initially we were worried that they'd be slow because of the extra steps involved (reflection is expensive) and how they retrieve unnecessary data (the whole object) when all we need is one field.
I'm starting to plan for a large data warehousing project that uses J2EE, but I'm a bit unsure whether to go for Stored Procedures or JDO/JPA and the like. Recently, I've been working with Hibernate, and to be quite honest, I don't miss writing CRUD stored procedures!
It essentially boils down to:
+ Can be optimised on the server (although only the queries)
- There's likely to be more than a thousand stored procedures: add, delete, update, getById, etc, for each table.
+ I won't spend the next few months writing parameters.add("@firstNames", customer.getFirstName()); ...
- Will be slower than SPs (but most support paging)
What would you plump for in my situation. In this case I think it's a much of a muchness.
No correct solution
"JDO - Will be slower than SPs (but most support paging)"
This assumption is often false. There's no reason for SP's to be particularly fast. I've done some measurements and they're no faster than code outside the database.
A data warehouse is characterized by insert-only loads and long-running
SELECT...GROUP BY... queries.
You're not writing OLTP transactional processing. You're not using 3NF as a way to prevent update anomalies on update/delete transactions.
Since you're doing bulk inserts, a SP will definitely be slower than a bulk load utility. Bulk loaders are often multi-threaded and will consume all available CPU resources. The SP is part of the DB and can only share limited DB resources.
Since you're mostly doing
SELECT GROUP BY, a SP won't help much here, either. The SELECT statement doesn't benefit from being wrapped in a procedure.
You don't need them. They don't help.
You can easily benchmark a bulk-load and a query to demonstrate that SP's aren't helping.
Rod Johnson in his "J2EE Design adn Development" wrote a very clear analysis about ORM/StoredProcedures. He said that
Stored procedures should only be used in a J2EE system to perform operations that will always use the database heavily, whether they're implemented in the database or in Java code that exchanges a lot of data with the database.
As you're planning to implement a datawarehouse, I think that the stored procedures approach is the right choice.
I would suggest using the metadata to generate the scripts you use for loading into the data warehouse. This allows you to get performance benefits from using specialised load tools and perhaps from stored procedures (if you're using a sufficiently ancient database). Also, you will probably end up hand coding at least some SQL. Having your generic scripts done as stored procs will allow you to schedule all of them in the same way and not have to worry about changing how they are invoked when you rewrite some generated code to make it run better.
As for getting the data out, if what you're building in J2EE is a reporting tool, then you may be better off using JDO. While I'm not terribly familiar with the reporting side of things, one benefit I can see is that it will be easier to allow your end users to make custom reports that you did not anticipate in advance (although you've still got to have some limits on what they can do so that they don't take down the database in the process).