Pergunta

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes

I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:

>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...

This is a small piece of the file

  >env
  ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA 

I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.

This is what I have so far.

#!/usr/bin/perl


#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";            
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";

#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;

 #initial loop created to read through each line 
 while ( my $line = <INFILE> ) {
chomp $line;


# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
    print OUTFILE "Gene: $line\n";

}
# otherwise count the content of the next line. 
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
    my @array = split //, $line;
    my $array= (@array);
 # reset the counts of each variable
    $G = (); 
    $C = ();
    $A = ();
    $T = ();
    foreach $array (@array){ 
 #if statements asses which base is present and makes a running total of the bases.     
    if ($array eq 'G'){
        ++$G;   
    }        
    elsif ( $array eq 'C' ) {
    ++$C; }
    elsif ( $array eq 'A' ) {
    ++$A; }
    elsif ( $array eq 'T' ) {
    ++$T; }

  } 
# all is printed to the outfile
    print OUTFILE "G:$G\n";
    print OUTFILE "C:$C\n";
    print OUTFILE "A:$A\n";
    print OUTFILE "T:$T\n";
    print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
    print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";


}

}
#close the outfile and the infile
close OUTFILE; 
close INFILE;

Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.

The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.

Foi útil?

Solução 2

Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:

use strict; use warnings;

my %counts;
while (<>) {
  if (/^>/) {
    # print counts for the prev gene if there are counts:
    print_counts(\%counts) if keys %counts;
    %counts = (); # reset the counts
    print $_;     # print the Fasta header
  } else {
    chomp;
    $counts{$_}++ for split //;
  }
}
print_counts(\%counts) if keys %counts;  # print counts for last gene

sub print_counts {
  my ($counts) = @_;
  print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}

Usage: $ perl count-bases.pl input.fasta.

Example output:

> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13

Style comments:

When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).

use autodie;
open my $in,  "<", $infile;
open my $out, ">", $outfile;

Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like

$ perl count-bases.pl input.fasta >counts.txt

Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.

Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like

...
elsif ( $array eq 'C' ) {
    ++$C; }

I prefer cuddling elsif:

   ...
} elsif ($base eq 'C') {
    $C++;
}

This statement my $array= (@array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (@array) { ... }.

Outras dicas

Several things: 1. use hash instead of declaring each element. 2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.

my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);

foreach $gene (keys %seen) {
    print "$gene: $seen{$gene}\n";
}
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top