dynamically horizontal scalable key value store

https://stackoverflow.com/questions/2092348

21-09-2019
|

문제

Is there a key value store that will give me the following:

Allow me to simply add and remove nodes and will redstribute the data automatically
Allow me to remove nodes and still have 2 extra data nodes to provide redundancy
Allow me to store text or images up to 1GB in size
Can store small size data up to 100TB of data
Fast (so will allow queries to be performed on top of it)
Make all this transparent to the client
Works on Ubuntu/FreeBSD or Mac
Free or open source

I basically want something I can use a "single", and not have to worry about having memcached, a db, and several storage components so yes, I do want a database "silver bullet" you could say.

Thanks

Zubair

Answers so far: MogileFS on top of BackBlaze - As far as I can see this is just a filesystem, and after some research it only seems to be appropriate for large image files

Tokyo Tyrant - Needs lightcloud. This doesn't auto scale as you add new nodes. I did look into this and it seems it is very fast for queries which fit onto a single node though

Riak - This is one I am looking into myself, but I don't have any results yet

Amazon S3 - Is anyone using this as their sole persistance layer in production? From what I have seen it seems to be used for storage of images as complex queries are too expensive

@shaman suggested Cassandra - definitely one I am looking into

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

해결책

You are asking too much from open source software.

If you have a couple hundred thousand dollars in your budget for some enterprise class software, there are a couple of solutions. Nothing is going to do what you want out of the box, but there are companies that have products which are close to what you are looking for.

"Fast (so will allow queries to be performed on top of it)"

If you have a key-value store, everything should be very fast. However the problem becomes that without an ontology or data schema built on top of the key-value store, you will end up going through the whole database for each query. You need an index containing the key for each "type" of data you want to store.

In this case, you can usually perform queries in parallel against all ~15,000 machines. The bottleneck is that cheap hard drives cap out at 50 seeks per second. If your data set fits in RAM, your performance will be extremely high. However, if the keys are stored in RAM but there is not enough RAM for the values to be stored, the system will goto disc on almost all key-value lookups. The keys are each located at random positions on the drive.

This limits you to 50 key-value lookups per second per server. Whereas when the key-value pairs are stored in RAM, it is not unusual to get 100k operations per second per server on commodity hardware (ex. Redis).

Serial disc read performance is however extremely high. I have seek drives goto 50 MB/s (800 Mb/s) on serial reads. So if you are storing values on disc, you have to structure the storage so that the values that need to be read from disc can be read serially.

That is the problem. You cannot get good performance on a vanilla key-value store unless you either store the key-value pairs completely in RAM (or keys in RAM with values on SSD drives) or if you define some type of schema or type system on top of the keys and then cluster the data on disc so that all keys of a given type can be retrieved easily through a serial disc read.

If a key has multiple types (for example if you have data-type inheritance relationships in the database), then the key will be an element of multiple index tables. In this case, you will have to make time-space trade offs to structure the values so that they can be read serially from disc. This entails storing redundant copies of the value for the key.

What you want is going to be a bit more advanced than a key-value store, especially if you intend to do queries. The problem of storing large files however is a non-problem. Pretend your system can keys upto 50 meg. Then you just break up a 1 gig file into 50 meg segments and associate a key to each segment value. Using a simple server it is straight forward to translate the part of the file you want into a key-value lookup operation.

The problem of achieving redundancy is more difficult. Its very easy to "fountain code" or "part file" the key-value table for a server, so that the server's data can be reconstructed at wire speed (1 Gb/s) onto a standby server, if a particular server dies. Normally, you can detect server death using a "heart beat" system which is triggered if the server does not respond for 10 seconds. It is even possible to key-value lookups against the part-file encoded key-value tables, but it is inefficient to do so but still gives you a backup for the event of server failure. A bigger issues it is almost impossible to keep the backup up to date and the data may be 3 minutes old. If you are doing lots of writes, the backup functionality is going to introduce some performance overhead, but the overhead will be negligible if your system is primarily doing reads.

I am not an expert on maintaining database consistency and integrity constraints under failure modes, so I am not sure what problems this requirement would introduce. If you do not have to worry about this, it greatly simplifies the design of the system and its requirements.

Fast (so will allow queries to be performed on top of it)

First, forget about joins or any operation that scales faster than n*log(n) when your database is this large. There are two things you can do to replace the functionality normally implemented with joins. You can either structure the data so that you do not need to do joins or you can "pre-compile" the queries you are doing and make a time-space trade-off and pre-compute the joins and store them for lookup in advance.

For semantic web databases, I think we will be seeing people pre-compiling queries and making time-space trade-offs in order to achieve decent performance on even modestly sized datasets. I think that this can be done automatically and transparently by the database back-end, without any effort on the part of the application programmer. However we are only starting to see enterprise databases implementing these techniques for relational databases. No open source product does it as far as I am aware and I would surprised if anyone is trying to do this for linked data in horizontally scalable databases yet.

For these types of systems, if you have extra RAM or storage space the best use of it is to pre-compute and store the result of common sub-queries for performance reasons, instead of adding more redundancy to the key-value store. Pre-compute results and order by the keys you are going to query against to turn an n^2 join into a log(n) lookup. Any query or sub-query that scales worse than n*log(n) is something whose results need to be performed and cached in the key-value store.

If you are doing a large number of writes, the cached sub-queries will be invalidated faster than they can be processed and there is no performance benefit. Dealing with cache invalidation for cached sub-queries is another intractable problem. I think a solution is possible, but I have not seen it.

Welcome to hell. You should not expect to get a system like this for free for another 20 years.

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

You are asking for a miracle. Wait 20 years until we have open source miracle databases or you should be willing to pay money for a solution customized to your application's needs.

다른 팁

Amazon S3 is a storage solution, not a database.

If you only need simple key/value your best bet would be to use Amazon SimpleDB in combination with S3. Large files are stored on S3, while meta data for searching is stored in SimpleDB. this gives you a horizontally scalable key/value system with direct access to S3.

There's another solution, which seems to be exactly what you are looking for: The Apache Cassandra project: http://incubator.apache.org/cassandra/

At the moment twitter is switching to Cassandra from memcached+mysql cluster

HBase and HDFS together fulfil most of these requirements. HBase can be used to store and retrieve small objects. HDFS can be used to store large objects. HBase compacts small objects and stores them as larger ones on HDFS. Speed is relative - HBase is not as fast on random reads from disk as mysql (for example) - but is pretty fast serving reads from memory (similar to Cassandra). It has excellent write performance. HDFS, the underlying storage layer, is fully resilient to loss of multiple nodes. It replicates across racks as well allowing rack level maintenance. It's a Java based stack with Apache license - runs pretty much most OS.

The main weaknesses of this stack are less than optimal random disk read performance and lack of cross data center support (which is a work in progress).

I can suggest you two possible solutions:

1) Buy Amazon's service (Amazon S3). For 100 TB it will cost you 14 512$ monthly.
2) much cheaper solution:

Build two custom backblaze storage pods (link) and run a MogileFS on top of it.

Currently I'm investigating how to store petabytes of data using similar solutions, so if you find something interesting on that, please post you notes.

Take a look at Tokyo Tyrant. It is a very lightweight, high-performance, replicating daemon exporting a Tokyo Cabinet key-value store to the network. I've heard good things about it.

From what I see in your question Project Voldemort seems to be the closest one. Have a look at their Design page.

The only problem I see is how will it handle huge files, and according to this thread, thing aren't all good. But you can always work around that fairly easily using files. In the end - this is the exact purpose of a file system. Have a look at the wikipedia list of file systems - the list is huge.

You might want to take a look at MongoDB.

From what I can tell you're looking for a database/distrubuted filesystem mix, which might be hard or even impossible to find.

You might want to take a look at distributed filesystems like MooseFS or Gluster and keep your data as files. Both systems are fault-tolerant and distributed (you can put in and take out nodes as you like), and both are transparent to clients (built on top of FUSE) - you're using simple filesystem ops. This covers following features: 1), 2), 3), 4), 6), 7), 8). We're using MooseFS for digital movies storage with something aroung 1,5 PB of storage and upload/download is as fast as network setup allows (so the performance is I/O dependent, not protocol or implementation dependent). You won't have queries (feature 5) on your list), but you can couple such filesystem with something like MongoDB or even some search engine like Lucene (it has clustered indexes) to query data stored in filesystem.

Zubair,

I am working on a key-value store which so far is faster than anything else.

It does not (yet) use replication, missing your 2 first requirements, but this question inspired me - thanks for that!

no: Allow me to simply add and remove nodes and will redstribute the data automatically
no: Allow me to remove nodes and still have 2 extra data nodes to provide redundancy
ok: Allow me to store text or images up to 1GB in size (yes: unlimited)
ok: Can store small size data up to 100TB of data (yes: unlimited)
ok: Fast (so will allow queries to be performed on top of it) (yes: faster than Tokyo Cabinet's TC-FIXED array)
ok: Make all this transparent to the client (yes: integrated to web server)
ok: Works on Ubuntu/FreeBSD or Mac (yes: Linux)
ok: Free or open source (yes: freeware)

Besides single-thread performances superior to hash-tables and B-trees, this KV store is the ONLY ONE I KNOW to be "WAIT-FREE" (not blocking, nor delaying any operation).

MarkLogic is going in this direction. Not at all free, though...

In addition to what others have mentioned - you could take a look at OrientDB - http://code.google.com/p/orient/ a document and K/V store that looks very promising.

Check out BigCouch. It's CouchDB, but optimized for clusters (and all the big data problems clusters are appropriate for). BigCouch is getting merged into the CouchDB project as we speak, by the folks at Cloudant, many of whom are core committers to CouchDB.

Rundown of your requirements:

Allow me to simply add and remove nodes and will redstribute the data automatically

Allow me to remove nodes and still have 2 extra data nodes to provide redundancy

Yes. BigCouch uses Dynamo's concept of Quorum to set how many nodes keep how many copies of your data.

Allow me to store text or images up to 1GB in size

Yes. Just like CouchDB, you can stream blobs (such as files) of arbitrary size to the database.

Can store small size data up to 100TB of data

Yes. The team that built BigCouch did so because they were facing a system generating petabytes of data per second.

Fast (so will allow queries to be performed on top of it)

Yes. Queries are done by MapReduce in O(log n) time.

Make all this transparent to the client

Works on Ubuntu/FreeBSD or Mac

Free or open source

Yup! Open source under the Apache 2.0 license. The default install instructions are for a Debian system, like Ubuntu.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow