Question

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.

I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.

Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.

I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.

So where's a good place to get started?

Was it helpful?

Solution

I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at

    WikipediaDataWarehouseArticle

  • Operational database layer
    The source data for the data warehouse - Normalized for In One Place Only data maintenance

  • Data access layer
    The transformation of your source data into your informational access layer.
    ETL tools to extract, transform, load data into the warehouse fall into this layer.

  • Informational access layer
      • Report-facilitating Data Structure
          Data is not maintained here. It is merely a reflection of your source data
          Hence, denormalized structures (containing duplicate, but systematically derived data)
          are usually most effective here
      • Reporting tools
          How do you actually allow your users access to the data
          • pre-canned reports (simple)
          • more dynamic slice-and-dice access methods

        The data accessed for reporting and analyzing and the tools for reporting and analyzing data
        fall into this layer. And the Inmon-Kimball differences about design methodology,
        discussed later in the Wikipedia article, have to do with this layer.

  • Metadata layer (facilitates automation, organization, etc)

Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies

Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)

  • Microsoft (the platform of our 110 employee firm)
  • SAP
  • Oracle
  • IBM

    BiMarketStateArticle

My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.

Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing capabilities of SQL Server Analysis Services. We are working toward that, but now focused on improving the quality of our data cleansing in the "Data Access Layer".

I hope this helps you to get a sense of where to start looking.

OTHER TIPS

Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.

I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.

I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.

There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.

As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.

Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...

It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.

But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.

The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:

  • insert/update handled by triggers
  • query any point/range in history
  • you application developers will not see underlying 6NF anchor model.

The technology is open sourced and at the moment is unbeatable.

If you would have AM question you may want to ask on that tag .

Kimball is the simpler method for data warehousing.

We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top