Question

I basically have one big gigantic table (about 1.000.000.000.000 records) in a database with these fields:

id, block_id, record

id is unique, block_id is not unique, it contains about 10k (max) records with the same block_id but with different records

To simplify my job that deals with the DB I have an API similar to this:

Engine e = new Engine(...);
// this method must be thread safe but with fine grained locked (block_id) to improve concurrency
e.add(block_id, "asdf"); // asdf up to 1 Kilobyte  max

// this must concatenate all the already added records added block_id, and won't need to be bigger than 10Mb (worst case) average will be <5Mb
String s = e.getConcatenatedRecords(block_id);

If I map each block to a file(haven't done it yet), then each record will be a line in the file and I will still be able to use that API

But I want to know if I will have any peformance gain by using flat files compared to a well tunned postgresql database ? (at least for this specific scenario)

My biggest requirement though is that the getConcatenatedRecords method returns stupidly fast (not so with the add operation). I am considering caching and memory mapping also, I just don't want to complicate myself before asking if there is an already made solution for this kind of scenario ?

Was it helpful?

Solution 3

After some research. I found that these data stores makes for the most part of use cases I have:

The interesting part is that all they mostly back the API of java collections (lists, sets, maps...)

EDIT: All these Proyects allows me to open a file as a data store of huge collections and I can reference them by name, and there can be many collections per file. Each of them are indexed. The idea is that these proyects are to be used as a foundation for real databases, you can view them as the data store engine of the database (be it SQL or NoSQL). Because these proyects are the foundation for proyects like mongodb, h2database and orientdb, then I am sure that if the simplistic datasotre approach fits my needs, it will also scale without any problems. Because my partition needs are very simplistic I can also share the load with other nodes.

OTHER TIPS

It sounds like you already have this running in postgres - can you post the schema you're using? It's certainly possible to do better than a well-tuned database in very specific scenarios, but usually turns out to be vastly more work than you imagine going in (especially if you're synchronizing writes).

Are you using CLUSTER with your index? What are the storage settings for the table?

And how large can the table get before your queries become too slow?

Since you seem to be building an object store on top of PostgreSQL, why not use an object store instead?

I'd start with OpenStack Swift:

or, alternately, a distributed network file system, if that's closer to your needs. (ab)using PostgreSQL as a network file system isn't going to get you far if you care about performance. The only time I'd do that would be when I needed ACID semantics - such as atomic commits of some database changes along with a file they relate to.

You don't get atomic commit over multiple PostgreSQL instances (though you get close, with prepared tranactions) so I'm guessing that's not your use case. If it isn't, I suggest looking for the right too for the right job.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top