Question

I have a large dataset that looks like this:

identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

I would like to write a script that groups these lines by identifier (so the first 3 and the last 2 would be grouped) in order to compare them. So, for example, I would get the 3 29239999 and take one of the two with feature 3 as 3 and the last with feature 3 as 7. In particular, I would like to take the one that has the largest feature 2 (it would be the third line for 29239999).

My specific question: of my two options: (1) hashes and (2) making each identifier an object and then comparing them, which is the best?

Was it helpful?

Solution

If you really are working with a "large" data set and the data is already grouped by id like in your example, then I suggest that you process these as you go instead of building a huge hash.

use strict;
use warnings;

# Skip Header row
<DATA>;

my @group;
my $lastid = '';

while (<DATA>) {
    my ($id, $data) = split /,\s*/, $_, 2;

    if ($id ne $lastid) {
        processData($lastid, @group);
        @group = ();
    }

    push @group, $data;
    $lastid = $id;
}

processData($lastid, @group);

sub processData {
    my $id = shift;

    return if ! @_;

    print "$id " . scalar(@_) . "\n";

    # Rest of code here
}

__DATA__
identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

Outputs

29239999 3
17221882 2

OTHER TIPS

It sounds like you want to print only the first occurrence of any pair of ID and Feature 3. Is that right?

This program will do that for you. It expects the path to the input file as a parameter on the command line, and send the revised data to STDOUT.

use strict;
use warnings;

my %seen;

print scalar <>; # Copy header line

while (<>) {
  next unless /\S/;
  my @fields = split /,/, $_, 5;
  my $key = join '/', @fields[0,3];
  print unless $seen{$key}++;
}

output

identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top