Pregunta

I want to count the number of occurrences of nucleotides (the letters 'A, T, G, C' in a string). I was trying to use the tr/// operator for this, but it returns a count of zero every time in the below code.

This only happens if I use a variable inside the tr/// operator. If I type the individual letters separately it works. I wanted to know if we could use variables inside tr/// operator for pattern matching (and counting). And if we can, someone tell me how to modify my code.

Later I plan to count the number of codons (~64). Hence the trouble. Appreciate your time. Thanks!

#!/usr/bin/perl

use strict;
use warnings;

my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);
my %hash = ();

foreach my $nt(@atgc) {
  $hash{$nt} = ($orf =~ tr/$nt//);
}
¿Fue útil?

Solución

$hash{$nt} = eval "\$orf =~ tr/\Q$nt\E//"

should do the job. Maybe this is not the most effective solution though.

Otros consejos

There is no instance of "$", "n" or "t" in "ATGCTAGCTAGCATAGAGCTAGCTA", thus tr correctly returns zero.

If you want to build a tr/// operator, you'll need to parse and compile it.

my %counts;
for my $nt (@atgc) {
   $counts{$nt} = eval "\$orf =~ tr/\Q$nt\E//";
}

But I wouldn't use tr///.

my %atgc = map { $_ => 1 } @atgc;
my %counts;
++$counts{$nt} for grep $atgc{$_}, split //, $orf;

Unfortunately Perl won't interpolate a variable into the search list for tr///. You will have to use a regular expression instead:

use strict;
use warnings;

my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);
my %count;

$count{$1}++ while $orf =~ /([@atgc])/g;

printf "%s => %d\n", $_, $count{$_} for @atgc;

output

A => 8
T => 6
G => 6
C => 5

Beware

This isn't - as it may appear - a general solution for matching any one of an array of strings. What is happening is @atcg is being interpolated into the regex as if it was in a double-quoted string. That means Perl will use the $" built-in variable (set to a single space by default) to turn it into a string equivalent to join $", @atgc.

So the code actually looks like

$count{$1}++ while $orf =~ /([A T G C])/g;

which will match spaces as well as the letters, and may break altogether if @atgc contains anything that is special inside a regex character class like ^, ] or -.

Counting spaces unnecessarily shouldn't be a problem, but if your list may contain symbols then this isn't a solution you should be using.

A count for every ASCII character can be written safely as

$count{$1}++ while $orf =~ /(.)/sg;

and the unwanted information in the %count hash can simply be ignored.

You could use s///g:

#!/usr/bin/env perl

use strict; use warnings;

my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);

my %hash = map {$_ => $orf =~ s/$_/$_/g } @atgc;

Output:

---
A: 8
C: 5
G: 6
T: 6

If the source string is going to be really long, you can treat it as holding the contents of a while and read one byte at a time assuming the characters are guaranteed to be ASCII.

#!/usr/bin/env perl

use strict; use warnings;

my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my %atgc = map { $_ => undef } qw(A T G C);

use YAML;
print Dump count(\$orf, \%atgc);

sub count {
    my $src_ref = shift;
    my $lookup = shift;

    my %count;

    open my $src, '<', $src_ref or die $!;
    {
        local $/ = \1;
        exists $lookup->{$_} and $count{ $_ } += 1 while <$src>;
    }
    close $src;

    return \%count;
}

You don't need regular expressions etc. to do this, you just need to walk a string:

my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my %nt;

$nt{$_}++ foreach (split('', $orf));

Make a pattern compiler.

sub make_counter {
  my @sequences = @_;

  my $pattern = "(?:" . join("|", map quotemeta, @sequences) . ")";

  my $compiled = eval q<
    sub {
      local($_) = @_;
      my $n = () = /$pattern/g;
    }
  >;

  if (defined $compiled) {
    return $compiled;
  }
  else {
    die "$0: internal: counter compilation failed:\n$@\n";
  }
}

With quotemeta, we force all characters in the sequence to match themselves only with no special meanings. Section 4 of the Perl FAQ describes that funky bit for counting matches:

Another version uses a global match in list context, then assigns the result to a scalar, producing a count of the number of matches.

$count = () = $string =~ /-\d+/g;

Note that it tolerates junk in your sequence. For example, count nucleotides with

my @nucleotides = qw/ G A T C /;

my $numnuc = make_counter @nucleotides;
print $numnuc->("xGxAxTxxxxTyA1C2A"), "\n";

Output:

7

Count codons with

my @codons = qw(
  TTT TCT TAT TGT TTC TCC TAC TGC TTA TCA TAA TGA
  TTG TCG TAG TGG CTT CCT CAT CGT CTC CCC CAC CGC
  CTA CCA CAA CGA CTG CCG CAG CGG ATT ACT AAT AGT
  ATC ACC AAC AGC ATA ACA AAA AGA ATG ACG AAG AGG
  GTT GCT GAT GGT GTC GCC GAC GGC GTA GCA GAA GGA
  GTG GCG GAG GGG
);

my $numcod = make_counter @codons;
print $numcod->("GAG-GGG!AGG,TAT#TTT");

Note that any junk, if present, must occur between codon sequences.

Output:

5
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top