Question

I have a tab-delimited file:

AA      11
AA      22
AA      11
AA      22
BBB     44
BBB     77
BBB     44
BBB     77

I want to print the distinct lines of the file:

AA      11
AA      22
BBB     44
BBB     77

I wrote this Perl script to do that:

#!/usr/bin/perl
$file1=$ARGV[0];
%record;
open(FP1,"$file1");
while($s1=<FP1>)
{
    chomp($s1);
    @array= split(/\t/,$s1);
    $name1=$array[0];
    $name2=$array[1];
    push @{$record{$name1}{trs}}, $name2;
    $ref=\%record;
}
for $name1 ( sort { $a <=> $b } keys %record )
{
    my $name2   = $$ref{$name1}{trs};
    print "$name1\t$name2\n";
}

but it doesn't work. Can someone help?

Was it helpful?

Solution

You don't have to read whole file into memory first, if you don't want to change lines order,

use strict;
use warnings;
use autodie;

my $file1 = $ARGV[0];
my %seen;
open(my $FP1, "<", $file1);

while (my $s1 = <$FP1>) {
  next if $seen{$s1}++;
  print $s1;
}

OTHER TIPS

The main problem is that you store the AA records in an array called @{$record{'AA'}{trs}} (i.e. @{$record{'AA'}->{'trs'}}), but when you go to print those records, you don't iterate over that array, you just try to read it as a scalar.

The fact that your file is tab-delimited does not seem to be relevant, since you apparently consider two lines to be distinct if either record is different. So you don't need to worry about the complexity of converting your lines to "records" for processing.

(Even aside from that, you have a lot of unnecessary code — for example, there's no reason at all to create $ref.)

You can actually just dispense with Perl, and use the standard sort utility:

sort -u <INPUT_FILE >OUTPUT_FILE

If the rows match exactly then all you need is the uniq function from List::MoreUtils

use strict;
use warnings;

use List::MoreUtils 'uniq';

my @data = <DATA>;
chomp @data;
print "$_\n" for uniq @data;


__DATA__
AA  11
AA  22
AA  11
AA  22
BBB 44
BBB 77
BBB 44
BBB 77

output

AA  11
AA  22
BBB 44
BBB 77

If you cannot install List::MoreUtils and have to do without uniq then you would have to write something like this

use strict;
use warnings;

use List::MoreUtils 'uniq';

my @data = <DATA>;
chomp @data;

my (@unique, %unique);
for (@data) {
    push @unique, $_ unless $unique{$_}++;
}

print "$_\n" for @unique;

The output is identical to that of the previous example.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top