Debugging a Perl program: Hashes, Writing to and Accessing Filehandles

https://stackoverflow.com/questions/11480615

20-06-2021
|

Question

I'm working on a bioinformatics project that involves piping together different scripts and input parameters for the analysis of Next-Gen Sequencing Illumina data. I need help debugging the first script. It's task is to parse a qseq file, convert the 'good' samples to fastq format, and save the output to temporary txt files (to disk).

For purposes of debugging, the piping scheme goes like this:

# the input parameters are "fed" into the script and the output is written
# to tmp.txt
script01.pl [input parameters] > tmp.txt

I type this command into the terminal then look in the tmp.txt file to check if the script outputs the expected results. But for the overall project I have something called a wrapper script that pipes all the scripts together. Script01 will save the output to tmp files for end1 and end2 data since they need to be processed by scripts 02-06 separately.

Here's the the code. I added descriptive comments so you can understand what's going on. Also, I don't have the qseq file to show you but just assume that I'm parsing the fields correctly:

#!/usr/bin/perl
use strict; use warnings;

my $end1_temp=shift; # this variable is the location of the temporary file
my $end2_temp=shift; # this variable is the location of the temporary file
my $qseq_file=shift;
my $barcode=shift;

# declare and initialize variables
my $overhang= 'C[AT]GC';
my $end1_trim_offset= length($barcode)+4;
my $end2_trim_offset= 4; 

my %to_keep=(); # an empty hash
my @line=();    # an empty array

# open the qseq file, or exit
open (QSEQ_1_FILE, $qseq_file) or die "couldn't open $qseq_file for read\n";

# also, open (for output) the end1 temp file, so we can write to it while
# we process the end1 input file above
(open END_1_FILEOUT,">$end1_temp") or die "couldn't open $end1_temp for write\n";

# reads each line of the qseq data file one at a time. 
# assume each sample is kept on a separate line
while(<QSEQ_1_FILE>){
chomp; @line=split; # index and formats the qseq fields for parsing

# skip samples those that didn't pass QC
$line[10]>0 or next;

# process the samples here. look at only the samples that pass the quality 
# control and whose sequences contain the barcode+overhang at the beginning
# of the string, otherwise skip to the next sample in the data file 
# (i.e start the next iteration of the loop)

# trim the barcode+overhang from seq and qual
$line[8]= substr($line[8], $end1_trim_offset); #sequence
$line[9]= substr($line[9], $end1_trim_offset); #quality

# unique sequence identifier. don't worry about what $line[1-6] represent
# just know that $identifier is unique for each of the 'good' samples
my $identifier = $line[0].'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# store the identifiers of the 'good' samples in a hash.
# the hash should contain the identifier as the key and numbers (1,2,3,etc.)
# as the values. the following increments the hash values for each identifier.
$to_keep{$identifier}++;

# the following below is suppose to write information to the filehandle in fastq format:
# @[the identifier]/1
# sequence [ATGCAGTAAT...]
# +[the identifier]/1
# quality [ASCII characters]

print END_1_FILEOUT '@' . "$identifier/1\n$line[8]\n" . '+' . identifier/1\n$line[9]\n";

}
print "Found " . int(keys %to_keep) . " reads from end1\n";
# close the filehandles
close QSEQ_FILE; close END_1_FILEOUT;

At this point, I have a hash containing the identifiers of only the 'good' sequences and have written the fastq data to a stored location. Script01 is suppose to save the fastq output to temp files on disk.

$end1_temp = '~/tmp/sampleD1_end1.fq' and end2_temp = '~/tmp/sampleD1_end2.fq'

Question: Is the print END_1_FILE line above writing the fastq data to the filehandle or is it writing to the $end1_temp variable? I ask because I'm going to need to pass the $end1_temp and the $end2_temp variables to script02. Also, for purposes of debugging how do I view the fastq output from script01?

Here's the rest of the code that I need help with. It's on the same script and directly follows the code above:

# if the sequence is paired (has both end1 and end2 data), then the qseq_2 file exists and the conditions evaluates to true
if ($end2_temp) {
# changes the qseq filename from file name from end1 to end2 data
# don't worry about why it works
$qseq_file=~ s/'_1_'/'_2_'/;

open (QSEQ_2_FILE, $qseq_file) or die "could not open $qseq_file for read\n";

# open (for output) the end2 temp file, so we can write to it while we process
# the end2 input file above
open (END_2_FILEOUT, ">$end2_temp") or die "could not open $end2_temp for write\n";

# reads each line of the end2 file one at a time
while(<QSEQ_2_FILE>){
# skip comments
/^\#/ and next;

chomp; #keep the qseq fields on one line.
my @line= split; #indexes the qseq fields.

# unique sequence identifier that preserves the sequencing information
# in other words, samples from end1 and end2 will have the same unique
# identifier because they contain the exact same fields in columns 0-6 
# or $line[0-6]. also end1 and end2 have the same number of samples
my $identifier = $line[0] .'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# recall the %to_keep hash which stored the identifiers of the 'good' samples; 
# it was inside the QSEQ_1_FILE loop
# here I only want the end2 samples whose identifiers match the identifiers
# from the end1 samples
# skip sample where end1 didn't pass; does this work??
# the condition is suppose to evaluate to false if the identifiers don't match
$to_keep{$identifier} or next;

# trim the barcode+overhang from seq and qual
$line[8]=substr($line[8], $end2_trim_offset);  # sequence
$line[9]=substr($line[9], $end2_trim_offset);  # quality

print END_2_FILEOUT '@' . "$identifier/2\n$line[8]\n" . '+' . "$identifier/2\n$line[9]\n";

}

close QSEQ_2_FILE; close END_2_FILEOUT;

}

This is the end of script01. At this point, I should have written the fastq data of 'good' samples for end1 and end2 to separate stored locations. Script01 is suppose to save the fastq output to two temp files on disk. I guess my question is for purposes of debugging, how do I view the tmp files created by script01?

Lastly, when I enter the command script01.pl [input parameters] > tmp.txt into the Linux terminal, it saves the output of script01 to tmp.txt. "Found X reads from end1" is what the script prints after processing the end1 reads, where X is the number of reads in the %to_keep hash.

When I look in tmp.txt, it displays Found 0 reads from end1. Since it prints 0, it means there is nothing stored in the hash. It's suppose to store around 6.3 million reads from end1. Can someone help me figure out why none of the reads are getting stored in the hash?

I think the problem is none of the reads are passing the the criteria I'm using for deciding whether or not they should be stored in hash. Another problem might be with how I'm storing the identifiers.

Can you guys look over it and see if there's anything I might have missed?

thanks. any suggestions or answers to my questions are greatly appreciated.

Solution

Question: Is the print END_1_FILE line above writing the fastq data to the filehandle or is it writing to the $end1_temp variable?

I assume you mean END_1_FILEOUT. This is an writeable filehandle to a file whose named is stored in $end1_temp. The data will be sent to the file.

Also, for purposes of debugging how do I view the fastq output from script01?

You look in the file. From your question I think the file is called ~/tmp/sampleD1_end1.fq. You should look in there.

Script01 is suppose to save the fastq output to two temp files on disk. I guess my question is for purposes of debugging, how do I view the tmp files created by script01?

As before, the output should be in the files whose names are in $end1_temp and $end2_temp, presumably ~/tmp/sampleD1_end1.fq and ~/tmp/sampleD1_end2.fq. Open them with your editor and take a look.

Lastly, when I enter the command script01.pl [input parameters] > tmp.txt into the Linux terminal, it saves the output of script01 to tmp.txt. When I look in tmp.txt it displays "Found 0 reads from end1". Since it prints 0, it means there is nothing stored in the hash. It's suppose to store around 6.3 million reads from end1. Can someone help me figure out why none of the reads are getting stored in the hash?

This is basic debugging. From your code it is clear that either the QSeq file you are reading is empty or your test $line[10] > 0 or next is throwing out every record. Both are simple to check by adding a diagnostic print statement directly after the split to show the value of $line[10]. My bet is that you are looking at the wrong field.

In addition to this you really should indent your code. That would help us to understand it far better than the extensive comments. Also,

$line[10] > 0 or next

is better written as

next unless $line[10] > 0

and you should use the three-parameter form of open and lexical filehandles.

Here is a tidied version of your code with these improvements as well as a few more

use strict;
use warnings;

die qq(Insufficient arguments "@ARGV") unless @ARGV >= 4;

my ($end1_temp, $end2_temp, $qseq1_file, $barcode) = @ARGV;

my $overhang         = 'C[AT]GC';
my $end1_trim_offset = length($barcode) + 4;
my $end2_trim_offset = 4;

my $idformat = '%s-%s-%s_%s-%s-%s_#%s';

my %to_keep;

open my $qsec1, '<', $qseq1_file
    or die "couldn't open $qseq1_file for read\n";

open my $end1, '>', $end1_temp
    or die "couldn't open $end1_temp for write\n";

while (<$qsec1>) {

  my @line = split;
  next unless $line[10] > 0;

  $_ = substr($_, $end1_trim_offset) for @line[8,9];

  my $identifier = sprintf $idformat, @line;
  $to_keep{$identifier}++;

  print $end1
      '@' . "$identifier/1\n" .
      "$line[8]\n" .
      '+' . "$identifier/1\n" .
      "$line[9]\n";
}

close $qsec1;
close $end1;

printf "Found %d reads from end1\n", int keys %to_keep;


exit unless $end2_temp;

$qseq1_file =~ s/'_1_'/'_2_'/;

open my $qseq2, '<', $qseq1_file
    or die "could not open $qseq1_file for read\n";

open my $end2, '>', $end2_temp
    or die "could not open $end2_temp for write\n";

while (<$qseq2>) {

  next if /^#/;
  my @line = split;

  my $identifier = sprintf $idformat, @line;
  next unless $to_keep{$identifier};

  $_ = substr($_, $end2_trim_offset) for @line[8,9];

  print $end2
      '@' . "$identifier/2\n" .
      "$line[8]\n" .
      '+' . "$identifier/2\n" .
      "$line[9]\n";
}

close $qseq2;
close $end2;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow