Domanda

I have got a huge tab-seperated file with up to 200 million rows (normally around 20 million) and two columns: the first column contains an ASCII word with up to 40 chars the second contains an integer.

I would like to do the following steps:

  1. sort by first column
  2. delete duplicate rows to make all rows unique
  3. read out all rows for given entry in first column

I have a memory limit of 3 GB (so read all data into a hash won't work), unlimited hard disk space and want to run the script on a single core. I am intending to run several scripts in parallel, so the read and write operations on the hard disk shouldn't be to high.

How should proceed with the implementation of my script (in Perl) considering the size of the file?

Which algorithm do you recommend for the first step considering the size of the file?

Step 3 is the most complex part I think. How should I handle this? I am not familiar with indexing algorithms. Could you suggest one that is best for the problem? Are there any Perl modules that I could use?

Does it make sense to first convert the file into a binary file (Like converting SAM to BAM)? If yes, have you got any instructions or algorithms for converting and handling such files?

È stato utile?

Soluzione

Reading the entire file into a SQLite database would be my first attempt.

Define the table like:

create table mytuples (
    mykey varchar(40),
    myval integer,
    constraint tuple_pk primary key(mykey, myval) on conflict ignore
);

A simple script using DBI which ignores insert errors should do it.

Untested, and error checks omitted

#!/usr/bin/env perl

use strict; use warnings;
use autodie;

use DBI;

my ($infile) = (@ARGV);

open my $in, '<', $infile;

my $dbh = DBI->connect('dbi:SQLite:some.db', undef, undef, {
        AutoCommit => 0,
        RaiseError => 0,
    },
);

while (my $line = <$in>) {
    my ($key, $val) = split ' ', $line;
    $dbh->do(q{INSERT INTO mytuples VALUES(?, ?)}, undef, $key, $val);
}

$dbh->commit;
$dbh->disconnect;

This may end up slower than sort and grep on the command line for the initial processing, but you may appreciating the flexibility of having SQL at your disposal.

Altri suggerimenti

Use the system sort to sort the file. The latest GNU Sort has a parallel option. Run uniq and then reading the sorted file one line at a time and noticing when the first column changes is easy. The sort uses a sort/merge algorithm which splits the file up into smaller chunks to sort and then merge, so memory is not an issue except for speed as long as you have plenty of disk.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top