Question

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).

I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.

I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.

Ideally I'd like to deploy this on EC2.

Was it helpful?

Solution

Wow. You are opening up a huge topic.

A few things right off the top of my head...

  1. think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
  2. look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
  3. look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
  4. prototype with some meaningful queries and some realistic data volumes
  5. get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
  6. think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos

as I say, huge topic. As I think of more I'll continue adding to my list.

HTH and good luck

OTHER TIPS

@Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:

  1. Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
  2. Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
  3. Separate the schemas for your transactional and data warehouse
  4. Seriously consider a periodic ETL from transactional db to the data warehouse.
  5. Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
  6. Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
  7. Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.

Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.

If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.

You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.

Wow.. This is a huge topic.

Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.

Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).

I believe you can approach this two ways

  • Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help

  • Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.

As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top