Question

I am writing a Perl script that takes two files as input: one input is a tab-separated table with an identifier of interested in the second column, the second input is a list of identifiers that match the second column of the first file. THE GOAL is print only those lines of the table which contain an identifier in the second column and to print each line only once. I have written three versions of this program and have been finding different numbers of lines printed in each.

Version 1:

    # TAB-SEPARTED TABLE FILE
    open (FILE, $file); 
    while (<FILE>) {
            my $line = $_;
            chomp $line;
            # ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
            foreach(@refs) {  
                    my $ref = $_;
                    chomp $ref;
                    if ( $line =~ $ref) { print "$line\n"; next; }
            }
    }

Version 2:

    # ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
    foreach(@doc) { 
            my $full = $_;
            # IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
            if ( $counter == 0 ) { 
                      print "$full\n"; 
                      $counter++; 
                      next; }
            # EXTRACT IDENTIFIER FROM LINE
            my @cells = split('\t', $full); 
            my $gene = $cells[1]; 
            foreach(@refs) {
                my $text = $_;
                if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER 
                        print "$full\n";
                        next;
                }
        }
        $counter--;
    }

Version 3:

    # LIST OF IDENTIFIERS
    foreach(@refs) {
        my $ref = $_;
        # LIST OF EACH ROW OF THE TABLE
        foreach(@doc) {
                my $line = $_;
                my @cells = split('\t', $line);
                my $gene = $cells[1];
                if ( $gene =~ $ref ) { print "$line\n"; next; }
        }
    }

Each of these approaches gives me different output and I do not understand why. I also do not understand if I can trust any of them to give me the right output. The right output should not contain any duplicate lines but more than one row might match any identifier from the list.

Sample Input File:

    Position        Symbol  Name    REF     ALT
    chr1:887801     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  A       G
    chr1:888639     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  T       C
    chr1:888659     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  T       C
    chr1:897325     KLHL17  kelch-like 17 (Drosophila)      G       C
    chr1:909238     PLEKHN1 pleckstrin homology domain containing, family N member 1        G       C
    chr1:982994     AGRN    agrin   T       C
    chr1:1254841    CPSF3L  cleavage and polyadenylation specific factor 3-like     C       G
    chr1:3301721    PRDM16  PR domain containing 16 C       T
    chr1:3328358    PRDM16  PR domain containing 16 T       C

List is pulled from a file that looks like this:

    A1BG
    A2M
    A2ML1
    AAK1
    ABCA12
    ABCA13
    ABCA2
    ABCA4
    ABCC2

Its put into an array using this code:

    open (REF, $ref_file);
    while (<REF>) {
        my $line = $_;
        chomp $line;
        push(@refs, $line);
     }
     close REF;
Was it helpful?

Solution

Whenever you hear "I need to look up something", think hashes.

What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:

#!/usr/bin/env perl

use warnings;
use strict;
use feature qw(say);

use autodie;   # This way, I don't have to check my open for failures

use constant {
    TABLE_FILE          => "file1.txt",
    LOOKUP_FILE         => "file2.txt",
};

open my $lookup_fh, "<", LOOKUP_FILE;

my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
    chomp $symbol,
    $lookup_table{$symbol} = 1;
}

close $lookup_fh;

open my $table_file, "<", TABLE_FILE;

my %is_printed;
while ( my $line = <$table_file> ) {
    chomp $line;
    my @line_array = split /\s+/, $line;
    my $symbol = $line_array[1];
    if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
        say $line;
        $is_printed{$symbol} = 1;
    }
}

Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.

I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.

I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:

 open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);

I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.

The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.

Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.

OTHER TIPS

I'm pretty sure this should work:

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' identifiers_file.txt data_file.txt

Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):

A1BG
A2M                               
A2ML1                       
AAK1                   
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L

then your output will be:

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' idents.txt data.txt
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C

If that's correct and you want a Perl version, you can just:

$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
    | a2p

I suggest you to mix first version and second and add hashes to them. First version because it's good(clear way) parse your data file line by line.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
    my $line = $_;
    chomp $line;
    $refs{$line} = 0;
}
close REF;

#for head printing
$refs{'Symbol'} = 0;

open (FILE, $ARGV[1]); 
while (<FILE>) {
        my $line = $_;
        my @cells = split('\t', $line);
        my $gene = $cells[1];
        #print $line, "\n" if exists $refs{$gene};
        if(exists $refs{$gene} and $refs{$gene} == 0)
        {
            $refs{$gene}++;
            print $line;
        }
}
close FILE;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top