Question

We are currently facing a problem of how to effectively store and retrieve data from very large data sets (into the billions). We have been using mysql and have optimized the system, OS, raid, queries, indexes etc, and are now looking to move on.

I need to make an informed decision about what technology to pursue to solve our data problems. I have been investigating map/reduce with HDFS, but also have heard good things about HBase. I can't help but think there are other options as well. Is there a good comparison of the technologies available and what the trade-offs of each are?

If you have links to share on each, I would appreciate that as well.

Was it helpful?

Solution

It is broad issue. I will try to give directions, and for each one you can look or ask for further information. First one are ...conventional DBs. If data is valuable enough that you can have RAIDs and good server - Oracle might be good, bat expensive solution. TPC-H is an industry standard benchmark for the decision support queries: http://www.tpc.org/tpch/results/tpch_perf_results.asp and it is a link to the top performance result. As you can see - RDBMS can scale to terabytes of data.
Second is Hadoop in form of HDFS + Map/Reduce + Hive. Hive is datawarehousing solution on top of MapReduce. You can get some additional benefits like capability to store data in original format and scale linearly. One of things you will looks - indexing and running very complex queries.
Third one are MPP - massive parralel processing databases. They are scalable from dozens to hundreds of nodes and have rich SQL support. Examples are Netezza, Greenplum, Asterdata, Vertica. Selection among them is not a simple task, but with more precise requirements it also can be done.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top