Question

I am working on a Perl project that involves building a hash with about 17 million keys. This is too big to be stored in memory (my laptop's memory will only hold about 10 million keys). I know that the solution is to store the data on disk, but I'm having trouble executing this in practice. Here's what I've tried:

DB_File

use strict;
use DB_File;
my $libfile = shift;
my %library;
tie %library, "DB_File", "$libfile";
for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library{$key} = $value;
}

This gives me a segmentation fault: 11 part way through the loop, for reasons I don't understand.

BerkeleyDB

use strict; 
use BerkeleyDB;
my $libfile = shift;
my $library = new BerkeleyDB::Hash
    -Filename => $libfile,
    -Flags => DB_CREATE;

for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library->db_put($key, $value);
}

This seems to work well for about the first 15 million keys, but then slows down dramatically and finally freezes completely near the end of the loop. I don't think this is a memory issue; if I break the loop into four pieces, put them in four separate programs, and run them sequentially (adding ~4 million records to the database each time), the first three complete successfully, but the fourth one hangs when the database has about 15 million keys. So it seems like maybe BerkeleyDB can only handle ~15 million keys in a hash???

DBM::Deep

use strict; 
use DBM::Deep;
my $libfile = shift;
my $library = new DBM::Deep $libfile;

for (my $a = 1; $a < 17000000; a++) {
    # Some code to generate key and value #
    $library->put($key => $value);
}

From preliminary tests this seems to work ok, but it's REALLY slow: about 5 seconds per thousand keys, or ~22 hours to run the whole loop. I'd prefer to avoid this if at all possible.

I'd be very grateful for suggestions on troubleshooting one of these packages, or ideas about other options for accomplishing the same thing.

UPDATE

Was it helpful?

Solution

Switching to btree may improve performance for HUGE BerkeleyDB accessed in "key sorted mode". It reduces number of disk I/O operations required.

Case study: In one case reported in news:comp.mail.sendmail I remember HUGE BerkeleyDB creation time was reduced from a few hours for hash to 20 minutes for btree with "key sorted" appends. It was too long anyway so the guy decided to switch to soft capable to access SQL database directly avoiding needs for SQL database "dumps" to BerkeleyDB. (virtusertable, sendmail->postfix)

OTHER TIPS

You can try PostgreSQL.

First create a table with two column, key and value, varchar will be fine,

then, instead of inserting each one, use Pg::BulkCopy to copy data to the database.

I recommend inserting not more than 100 MB at a time, because, when your COPY command FAIL, the PostgreSQL will keep those lines that was before inserted on the disk, and it will only remove it if you VACUUM FULL the table. (one time I processed lots of 5GB and a couple of it fail on some constraint on almost ending and disk never got back on rollback..)

ps: you can use DBD::Pg directly too: https://metacpan.org/pod/DBD::Pg#COPY-support

After all copy finish, you can create a index on key and if you need more speed, use Redis or memcached with MAXMEMORY POLICY

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top