Compare lines in a file

Question 1

If you really are working with a "large" data set and the data is already grouped by id like in your example, then I suggest that you process these as you go instead of building a huge hash.

use strict;
use warnings;

# Skip Header row
<DATA>;

my @group;
my $lastid = '';

while (<DATA>) {
    my ($id, $data) = split /,\s*/, $_, 2;

    if ($id ne $lastid) {
        processData($lastid, @group);
        @group = ();
    }

    push @group, $data;
    $lastid = $id;
}

processData($lastid, @group);

sub processData {
    my $id = shift;

    return if ! @_;

    print "$id " . scalar(@_) . "\n";

    # Rest of code here
}

__DATA__
identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

Outputs

29239999 3
17221882 2

Question 2

It sounds like you want to print only the first occurrence of any pair of ID and Feature 3. Is that right?

This program will do that for you. It expects the path to the input file as a parameter on the command line, and send the revised data to STDOUT.

use strict;
use warnings;

my %seen;

print scalar <>; # Copy header line

while (<>) {
  next unless /\S/;
  my @fields = split /,/, $_, 5;
  my $key = join '/', @fields[0,3];
  print unless $seen{$key}++;
}

output

identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...