Perl Text-Parsing; Which algorithm is correct?

Question 1

Whenever you hear "I need to look up something", think hashes.

What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:

#!/usr/bin/env perl

use warnings;
use strict;
use feature qw(say);

use autodie;   # This way, I don't have to check my open for failures

use constant {
    TABLE_FILE          => "file1.txt",
    LOOKUP_FILE         => "file2.txt",
};

open my $lookup_fh, "<", LOOKUP_FILE;

my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
    chomp $symbol,
    $lookup_table{$symbol} = 1;
}

close $lookup_fh;

open my $table_file, "<", TABLE_FILE;

my %is_printed;
while ( my $line = <$table_file> ) {
    chomp $line;
    my @line_array = split /\s+/, $line;
    my $symbol = $line_array[1];
    if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
        say $line;
        $is_printed{$symbol} = 1;
    }
}

Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.

I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.

I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:

 open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);

I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.

The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.

Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.

Question 2

I'm pretty sure this should work:

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' identifiers_file.txt data_file.txt

Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):

A1BG
A2M                               
A2ML1                       
AAK1                   
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L

then your output will be:

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' idents.txt data.txt
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C

If that's correct and you want a Perl version, you can just:

$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
    | a2p

Question 3

I suggest you to mix first version and second and add hashes to them. First version because it's good(clear way) parse your data file line by line.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
    my $line = $_;
    chomp $line;
    $refs{$line} = 0;
}
close REF;

#for head printing
$refs{'Symbol'} = 0;

open (FILE, $ARGV[1]); 
while (<FILE>) {
        my $line = $_;
        my @cells = split('\t', $line);
        my $gene = $cells[1];
        #print $line, "\n" if exists $refs{$gene};
        if(exists $refs{$gene} and $refs{$gene} == 0)
        {
            $refs{$gene}++;
            print $line;
        }
}
close FILE;