Question

I am asking this in the context of NoSQL - which achieves scalability and performance without being expensive.

So, if I needed to achieve massively parallel distributed computing across databases ... What are the various methodologies available today (within the RDBMS paradigm) to achieve distributed computing with high-scalability?

Does database clustering & mirroring contribute in any way towards distributed computing?

Was it helpful?

Solution

I guess you are asking about scalability of RDBMS databases. Talking about NoSQL databases based on ( amazon dynamo, BigTable ) are a whole another topic. I am talking about HBase, Cassandra etc. There are also commerical products like Oracle Coherence thats more like a distributed cache and key value store , to put it crudely.

going back to rdbms,

Sharding to scale RDBMS one can do cusstom sharding. Sharding is a technique where you have multiple table is possibly multiple hosts. And then you decide in a certain fashion to assign certain rows to certain tables. For example you can say that rows 1-1M goes to table1, 1M-2M goes to table2 etc. But, this is a difficult process from an administration point of view. A lot of large scale websites scale by relying on sharding. Other techniques worth mentioning are partioning and mysql federation and mysql cluster.

MPP databases Then there are databases are there very RDBMS which does distribution and scaling for you. Terradata is the most successful of these companies. I believe they used postgres core code at some point. A significant number of fortune 500 companies and a lot of the airlines use Terradata. But, its ridiculously expensive. There are newer companies like greenplum, vertica, netezza.

OTHER TIPS

Unless you're a very big company with extreme scalability requirements, you can horizontally and ACID scale up your DB by building a cluster of identical RDBMS instances and synchronizing them with JTA transactions.

Take a look to this Java/JDBC based article the JEPLayer framework is used but you can use straight JDBC and JTA code.

Within the RDBMS paradigm: Sharding.
Outside the RDBMS paradigm: Key-value stores.

My pick: (I come from an RDBMS background) Key-value stores of the tabluar type - HBase.

Within the RDBMS paradigm, sharding will not get you far.
Use the RDBMS paradigm to design your model, to get your project up and running.
Use tabular key-value stores to SCALE OUT.

Sharding:

A good way to think about sharding is to see it as user-account-oriented
DB design.

The all schema entities touched by a user-account are kept on one host.

The assignment of user to host happens when the user creates an account.
The least loaded host gets that user.

When that user signs on after account creation, he gets connected
to the host that has his data.

Each host has a set of user accounts.

The problem with this approach is that if the host gets hosed,
a fraction of users will be blacked out.

The solution to this is have a replicated standby host that
becomes the primary when the primary host encounters problems.

Also, it's a fairly rigid setup for processes where the design does
not change dramatically.

From the user standpoint, I've noticed that web sites
with a sharded DB backend are not as quick to "turn on a dime"
to create different business models on their platform.

Contrast this with web sites that have truly distributed
key-value stores. These businesses can host any range of
services. Their platform is just that - a platform.
It's not relational and it does have an API interface,
but it just seems to work.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top