Question

I am building system to analyze large quantities of financial data regarding securities trading prices. A large challenge in this is determining what storage method to use for the data given that the data will be in the 10's of terrabytes. There will be many queries on the data such as taking averages, calculating standard deviations, and sums filtered by multiple columns such as price, time, volume, etc. Join statements aren't a requisite, but would be nice to have.

Right now, I am looking at infobright community edition, monetdb, and greenplum community edition for evaluation purposes. They seem great so far, but for more advanced features, some of each are required are not available in some of these editions (using multiple servers, insert/update statements, etc).

What solutions would you use for this situation, and benefits does it provide over the alternatives? Being cost effective is a major plus. If I must pay for a data warehousing solution I will, but I would much rather avoid it and take the open-source/community edition route if possible.

Was it helpful?

Solution

Infobright delivers fast query performance with no tuning, no projections and no indexes on large volumes of data. On data loading, i have seen instances where 80TB of data per hour can load, over 12,000 inserts per second.

How does it work?

  1. Column Orientation vs Row Orientation
  2. Data Packs plus Compression average of 20:1
  3. Knowledge Grid - Sub second response on query
  4. Granular Engine, built on top of mysql architecture

I would still suggest you consider looking into the enterprise licensing, but you can certainly evaluate the community edition and test your performance and data loading needs against it.

Disclaimer: author is affiliated with Infobright.

OTHER TIPS

I think any of the databases you mention will do what you require. If you're dealing with 10's of TB worth of data getting an enterprise license to run in an MPP cluster would probably be a good use of funds to keep processing time down. Also, if this DW will be doing important processing for your organization, having a license means you get support from the vendor, which is important to a lot of businesses. YMMV.

A more important question would be what are your data ingest rates going to be looking like? For a financial system I think a large part of the equation should be the ability to keep loading fresh data into your system while continuing to do your normal processing.

I am only familiar with Greenplum out of your list of candidates but I know that it does well at loading large amounts of data in a short amount of time. GP also has a lot of built-in statistical and analytic functions you can run natively inside the DB including built-in SQL functions, MADLib, R, etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top