Question

i'm working on a biology software that generates some millions of strings (formed by nucleotide bases, A-G-C-T) of length usually bigger than 30 chars. It was written C.

I need a database to store this data on disk fast enough to don't create a bottleneck slowing the entire software and without consume too much RAM. Moreover, i need it to be completely linked inside my application. I don't want to force my users to install a SQL server or something like that.

I already tried hamsterDB, SQLite, Kyoto Cabinet and MapDB without success. The problem is that i need to insert or update data from database at ~50k operations/sec at least. With some optimizations i got SQLite to be the faster. It reachs 18k operations/sec (it uses synchronization off, journal_mode off, transactions, ignore_check_constraints on, cache_size of 500.000 and pre-compiled statements).

Each sequence is classified as A or B, and i need to know how many i have of each kind. Right now i'm using the sequence as a key and adding a counter for A types and another for B types. On SQLite databases i'm using columns and commands like these:

INSERT OR REPLACE INTO events (main_seq,qnt_A,qnt_B) VALUES (@SEQ,COALESCE((SELECT qnt_A FROM events WHERE main_seq=@SEQ)+1,1),(SELECT qnt_B FROM events WHERE main_seq=@SEQ))

This is slower than a simple INSERT INTO, but if the seq already exists on DB i need to just increment one of the columns.

With Kyoto Cabinet i got a really high speed, but it only supports string records and i need to add and update integers to count how many A and B i have.

Do anyone knows another good DB that may satisfy my needs on write speed and flexibility of records?

Was it helpful?

Solution

This BerkeleyDB whitepaper says that the theoretical limit is 70,000 transactions per second. Actual performance will be much less, and their theoretical limit is based on some assumptions that won't hold in your case. But they still claim that BerkeleyDB is substantially faster than SQLite.

If you consider that a single BDB writer measures a throughput of about 700 TPS, then the theoretical limit would be 70,000 TPS with 100 non-conflicting concurrently executing threads.

OTHER TIPS

The following benchmarks

find OpenLDAP MDB

to suit the case submitted, in particular for large random writes

MDB . 13,215 entries/sec
Kyoto TreeDB . 5,860 entries/sec
LevelDB . 3,138 entries/sec
SQLite3 . 2,068 entries/sec
BerkeleyDB . 1,952 entries/sec

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top