Question

I deal a lot with text files, comparing one to another in a "SQL manner".

DBD::CSV is obviously a good choice to start with, as I can use the power of SQL syntax on text tables. However, I deal with huge text files, making DBD::CSV useless in terms of performance.

So, I was starting to write a module that converts csv files to sqlite DB, and then returns DBI::sqlite object that I can play with. The thing is, that converting a text file to sqlite table can also be not very efficient, that because I can not run the sqlite command line from perl to load CSV files quickly (using .load). So I have to create a huge Insert into string based on the text tables, and execute it (executing Insert line by line is very inefficient in terms of performance, so I preferred executing a one big insert). I'm willing to avoid that, looking for a one-liner to load csv to sqlite using perl.

And another thing, I used the following functions to execute and print nicely a SQL query:

sub sql_command {
my ($self,$str) = @_;
my $s = $self->{_db}->prepare($str) or die $!;
$s->execute() or die $!;
my $table;
push @$table, [ map { defined $_ ? $_ : "undef" } @{$s->{'NAME'}}];
while(my $row = $s->fetch) {
    push @$table, [ map{ defined $_ ? $_ : "undef" }@$row ];
}
box_format($table);
return box_format($table);;
}


sub box_format {
my $table = shift;
my $n_cols = scalar @{$table->[0]};

my $tb = Text::Table->new(\'| ', '', (\' | ','')x($n_cols-1), \' |+');
$tb->load(@$table);
my $rule = $tb->rule(qw/- +/);
my @rows = $tb->body();
return $rule, shift @rows, $rule, @rows, $rule
    if @rows;
}

The sql_command sub takes about ~1min to execute(on 6.5 MB file), which In my opinion is way longer that I would expect it to be. Does anyone has a more efficient solution?

Thanks!

Was it helpful?

Solution

Text::CSV_XS is extremely fast, using that to handle the CSV should take care of that side of the performance problem.

There should be no need for special bulk insert code to make DBD::SQLite performant. An insert statement with bind parameters is very fast. The main trick is to turn off AutoCommit in DBI and do all the inserts in a single transaction.

use v5.10;
use strict;
use warnings;
use autodie;

use Text::CSV_XS;
use DBI;

my $dbh = DBI->connect(
    "dbi:SQLite:dbname=csvtest.sqlite", "", "",
    {
        RaiseError => 1, AutoCommit => 0
    }
);

$dbh->do("DROP TABLE IF EXISTS test");

$dbh->do(<<'SQL');
CREATE TABLE test (
    name        VARCHAR,
    num1        INT,
    num2        INT,
    thing       VARCHAR,
    num3        INT,
    stuff       VARCHAR
)
SQL

# Using bind parameters avoids having to recompile the statement every time
my $sth = $dbh->prepare(<<'SQL');
INSERT INTO test
       (name, num1, num2, thing, num3, stuff)
VALUES (?,    ?,    ?,    ?,     ?,    ?    )
SQL

my $csv = Text::CSV_XS->new or die;
open my $fh, "<", "test.csv";
while(my $row = $csv->getline($fh)) {
    $sth->execute(@$row);
}
$csv->eof;
close $fh;

$sth->finish;    
$dbh->commit;

This ran through a 5.7M CSV file in 1.5 seconds on my Macbook. The file was filled with 70,000 lines of...

"foo",23,42,"waelkadjflkajdlfj aldkfjal dfjl",99,"wakljdlakfjl adfkjlakdjflakjdlfkj"

It might be possible to make it a little faster using bind columns, but in my testing it slowed things down.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top