Question

I'm trying to tackle a problem by using a SQLite database and Perl modules. In the end, there will be tens of millions of entries I need to log. The only unique identifier for each item is a text string for the URL. I'm thinking of doing this in two ways:

Way #1: Have a good table, bad table, unsorted table. (I need to check the html and decide whether I want it.) Say we have 1 billion pages total, 333 million URLs in each table. I have a new URL to add, and I need to check and see if it's in any of the tables, and add it to the Unsorted if it is unique. Also, I would be moving a lot of rows around with this option.

Way #2: I have 2 tables, Master and Good. Master has all 1 billion page URLs, and Good has the 333 million that I want. New URL, need to do the same thing, except this time I am only querying one table, and I would never delete a row from Master, only add the data to Good.

So basically, I need to know the best setup to quickly query a huge SQLite database to see if a text string of ~20 characters is unique, then add if it isn't.

Edit: I'm now trying to get Berkeley DB to work using the Perl module, but no dice. Here's what I have:

use BerkeleyDB;

$dbFolder = 'C:\somedirectory';
my $env = BerkeleyDB::Env->new ( -Home => $dbFolder );

my $db  = BerkeleyDB::Hash->new (
-Filename => "fred.db", 
-Env => $env );
my $status = $db->db_put("apple", "red");

And when I run this, I get the following:

Can't call method "db_put" on an undefined value at C:\Directory\perlfile.pl line 42, <STDIN> line 1.
Was it helpful?

Solution

If $db is undefined, opening the database is failing, and you should inspect $! and $BerkeleyDB::Error to see why.

Have you created the database already? If not, you need -Flags => DB_CREATE.

Working example:

use strict;
use warnings;
use BerkeleyDB;

my $dbFolder = '/home/ysth/bdbtmp/';

my $db  = BerkeleyDB::Hash->new (
    -Filename => "$dbFolder/fred.db", 
    -Flags => DB_CREATE,
) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

my $status = $db->db_put("apple", "red");

I couldn't get BerkeleyDB::Env to do anything useful, though; whatever I tried, the constructor returned undef.

OTHER TIPS

I'd be inclined to use a hash instead of SQLite to do what you want to do. A hash is optimized to test for existence without the need to keep the values in any sorted order and with no need to keep a redundant copy of the datum in an index. The hash algorithm applied to the datum yields the location where it would be stored, if it did exist; you can seek to that location and see if it's there. I don't think you'd need to keep the hash table in RAM.

Here's how you might take a hybrid hash/SQLite approach.

Create a SQLite table

STORE
id INTEGER PRIMARY KEY
BUCKET (integer, indexed) 
URL (text, not indexed)
status 

You could have three of these tables, STORE1, STORE2, and STORE3 if you want to keep them separate by status.

Let's assume that there will be 250,000,001 distinct buckets in each store. (You can experiment with this number; make it a prime number).

Find a hashing algorithm that takes two inputs, the URL string and 250,000,0001 and returns a number between 1 and 250,000,001.

When you get a URL, feed it to the hashing algorithm and it will tell you which BUCKET to look in:

Select * from STORE where BUCKET = {the value returned by your hash function}.

Your index on the BUCKET field will quickly return the rows, and you can examine the URLs. If the current URL is not one of them, add it:

INSERT STORE(BUCKET, URL) VALUES( {your hash return value}, theURL). 

SQLite will be indexing integer values, which I think will be more efficient than indexing the URL. And the URL will be stored only once.

I don't know if this is optimal, but you could set up your SQLite DB such that the "good" table has a unique constraint on the URL column. You probably don't have enough RAM to do the comparisons in Perl (naive solution would be to create a hash where the URLs are the keys, but if you have a billion pages you'll need an awful lot of memory).

When it comes time to do an insert, the database will enforce the uniqueness and throw some kind of error when it tries to insert a duplicated URL. You can catch this error and ignore it, as long as DBI returns different error values for different error messages.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top