Open a file.txt and find the possible start and end positions of its genes

https://stackoverflow.com/questions/22272427

11-06-2023
|

Question

Hi I have a file and I would like to open it and find the start and end positions of its genes,also I have some extra imformations.The beginning of each gene is mapped by the following pattern. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed by 4-10 bases downstream before the initiation codon (ATG). However there are variants of the Shine-Dalgarno sequence with the most common of which being [TA][AC]AGGA[GA][GA].The end of the gene is specified by the stop codon TAA, TAG and TGA. It must be taken care the stop codon is found after the correct Open.Reading Frame (ORF). Now I have make a txt file with genome and I open it with this code,and the error begin when I go to read the genome and put start and end.Any help?Thanks a lot.:

#!/usr/bin/perl -w
    use strict;
    use warnings;
    # Searching for motifs
    # Ask the user for the filename of the file containing
    my $proteinfilename = "yersinia_genome.fasta";
    print "\nYou open the filename of the protein sequence data: yersinia_genome.fasta \n";
    # Remove the newline from the protein filename
    chomp $proteinfilename;
    # open the file, or exit
    unless (open(PROTEINFILE, $proteinfilename) ) 
    {
      print "Cannot open file \"$proteinfilename\"\n\n";
      exit;
    }
    # Read the protein sequence data from the file, and store it
    # into the array variable @protein
    my @protein = <PROTEINFILE>;
    # Close the file - we've read all the data into @protein now.
    close PROTEINFILE;
    # Put the protein sequence data into a single string, as it's easier
    # to search for a motif in a string than in an array of
    # lines (what if the motif occurs over a line break?)
    my $protein = join( '', @protein);
    # Remove whitespace.
    $protein =~ s/\s//g;
    # In a loop, ask the user for a motif, search for the motif,
    # and report if it was found.
    my $motif='TAAGGAGG';
    do 
    {
      print "\n Your motif is:$motif\n";
      # Remove the newline at the end of $motif
      chomp $motif;
      # Look for the motif
        if ( $protein =~ /$motif/ ) 
        {
          print "I found it!This is the motif: $motif in line $.. \n\n";
        } 
        else 
        {
          print "I couldn't find it.\n\n";
        }
    }
    until ($motif =~ /TAAGGAGG/g); 
    my $reverse=reverse $motif;
    print "Here is the reverse Motif: $reverse. \n\n";
    #HERE STARTS THE PROBLEMS,I DONT KNOW WHERE I MAKE THE MISTAKES
    #$genome=$motif;
    #$genome = $_[0];
    my $ORF = 0;
    while (my $genome = $proteinfilename) {
        chomp $genome;
        print "processing $genome\n";
        my $mrna = split(/\s+/, $genome);
        while ($mrna =~ /ATG/g) {
          # $start and $stop are 0-based indexes
          my $start = pos($mrna) - 3; # back up to include the start sequence
          # discard remnant if no stop sequence can be found
          last unless $mrna=~ /TAA|TAG|TGA/g;
    #m/^ATG(?:[ATGC]{3}){8,}?(?:TAA|TAG|TGA)/gm;
      my $stop    = pos($mrna);
      my $genlength = $stop - $start;
      my $genome    = substr($mrna, $start, $genlength);
      print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
      #      $ORF ++;
            #print "$ORF\n";
       }
    }
    exit;

Solution

Thanks,I have make it the solution is :

local $_=$protein;
while(/ATG/g){
my $start = pos()-3;
if(/T(?:TAA|TAG|TGA)/g){
my $stop = pos;
 print $start, " " , $stop, " " ,$stop - $start, " " ,
 substr ($_,$start,$stop - $start),$/;
 }
 }

OTHER TIPS

while (my $genome = $proteinfilename) {

This creates an endless loop: you are copying the file name (not the $protein data) over and over.

The purpose of the while loop is unclear; it will never terminate.

Perhaps you simply mean

my ($genome) = $protein;

Here is a simplistic attempt at fixing the obvious problems in your code.

#!/usr/bin/perl -w
use strict;
use warnings;
my $proteinfilename = "yersinia_genome.fasta";
chomp $proteinfilename;
unless (open(PROTEINFILE, $proteinfilename) ) 
{
  # die, don't print & exit
  die "Cannot open file \"$proteinfilename\"\n";
}
# Avoid creating a potentially large temporary array
# Read directly into $protein instead
my $protein = join ('', <PROTEINFILE>);
close PROTEINFILE;
$protein =~ s/\s//g;
# As this is a static variable, no point in looping
my $motif='TAAGGAGG';
chomp $motif;
if ( $protein =~ /$motif/ ) 
{
  print "I found it! This is the motif: $motif in line $.. \n\n";
}
else 
{
  print "I couldn't find it.\n\n";
}
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
# $ORF isn't used; removed
# Again, no point in writing a loop
# Also, $genome is a copy of the data, not the filename
my $genome = $protein;
# It was already chomped, so no need to do that again
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
  my $start = pos($mrna) - 3; # back up to include the start sequence
  last unless $mrna=~ /TAA|TAG|TGA/g;
  my $stop    = pos($mrna);
  my $genlength = $stop - $start;
  my $genome    = substr($mrna, $start, $genlength);
  print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
}
exit;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow