Question

I have to count the number of overlapping dimers (AA, AG, AC, AT, GA, GG, GC, GT, CC, CG, CA, CT, TT, TA, TG, TC) in multiple sequences using Perl. I wrote the following code but it only works for one sequence. How can I extend it to multiple sequences?

#!/usr/bin/perl -w
open FH, "sample.txt";
$genome=<FH>;
%count=();
$count{substr($genome, $_, 2)}++ for 0..length($genome)-2;
print "AA: $count{AA}\n";
print "AG: $count{AG}\n";
print "AC: $count{AC}\n";
print "AT: $count{AT}\n";

print "TA: $count{TA}\n";
print "TG: $count{TG}\n";
print "TC: $count{TC}\n";
print "TT: $count{TT}\n";

print "GA: $count{GA}\n";
print "GG: $count{GG}\n";
print "GC: $count{GC}\n";
print "GT: $count{GT}\n";

print "CA: $count{CA}\n";
print "CG: $count{CG}\n";
print "CC: $count{CC}\n";
print "CT: $count{CT}\n";

I need:

  1. counts for each sequence and
  2. the total counts

input example: sample.txt

ATGGGCTCCTCCGCCATCACCGTGAGCTTCCTCCTCTTTCTGGCATTTCAGCTCCCAGGGCAAACAGGAGCAAATCCCGTGTATGGCTCTGTGTCCAATGCAGACCTGATGGATTTCAAGTAAAAG
ATGGTGAGAAAATGGGCCCTGCTCCTGCCCATGCTGCTCTGCGGCCTGACTGGTCCCGCACACCTCTTCCAGCCAAGCCTGGTGCTGGAGATGGCCCAGGTCCTCTTGGATAACTACTGCTTCCCAGAGAACCTGATGGGGATGCAGGGAGCCATCGAGCAGGCCATCAAAAGTCAGGAGATTCTGTCTATCTCAGACCCTCAGACTCTGGCCCATGTGTTGACAGCTGGGGTGCAGAGCTCCTTGAATGACCCTCGCCTGGTCATCTCCTATGAGCCCAGCACCCTCGAGGCCCCTCCGCGAGCTCCAGCAGTCACGAACCTCACACTAGAGGAAATCATCGCAGGGCTGCAGGATGGCCTTCGCCATGAGATTCTGGAAGGCAATGTGGGCTACCTGCGGGTGGACGACATCCCGGGCCAGGAGGTGATGAGCAAGCTGAGGAGCTTCCTGGTGGCCAACGTCTGGAGGAAGCTCGTGAACACGTCCGCCTTGGTGCTGGACCTCCGGCACTGCACTGGGGGACACGTGTCTGGCATCCCCTATGTCATCTCCTACCTGCACCCAGGGAGCACAGTCTCGCACGTGGACACCGTCTACGACCGCCCCTCCAACACAACCACTGAGATCTGGACCCTGCCTGAAGCCCTGGGAGAGAAGTACAGTGCAGACAAGGATGTGGTGGTCCTCACCAGCAGCCGCACGGGGGGCGTGGCTGAGGACATCGCTTACATCCTCAAACAGATGCGCAGGGCCATCGTGGTGGGCGAGCGGACTGTTGGGGGGGCTCTGAACCTCCAGAAGCTGAGGGTAGGCCAGTCCGACTTCTTTCTCACTGTGCCTGTGTCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGTGGGAGGGCAGTGGGGTGCTGCCCTGTGTGGGGACACCGGCCGAGCAGGCCCTGGAGAAAGCCCTGGCCGTTCTCATGCTGCGCAGGGCCCTGCCAGGAGTCATTCAGCGCCTTCAGGAGGCGCTGCGCGAGTACTACACGCTGGTGGACCGTGTGCCCGCCCTGCTGAGCCACCTGGCCGCCATGGACCTGTCCTCGGTGGTCTCCGAGGACGATCTGGTCACTAAGCTCAATGCTGGCCTGCAGGCTGTGTCTGAGGACCCCAGGCTCCAGGTGCAGGTGGTCAGACCCAAAGAAGCCTCTTCTGGGCCTGAGGAAGAAGCTGAAGAACCTCCAGAGGCGGTCCCGGAAGTGCCCGAGGACGAGGCTGTTCGGCGGGCTCTGGTGGACTCCGTGTTCCAGGTTTCTGTGCTGCCGGGCAACGTGGGCTACCTGCGCTTCGACAGTTTCGCTGATGCCTCTGTCCTGGAGGTGCTGGGCCCCTACATCCTGCACCAGGTGTGGGAGCCCCTGCAGGACACGGAGCACCTCATCATGGACCTGCGGCAGAACCCCGGGGGGCCGTCCTCCGCGGTGCCCCTGCTGCTCTCCTACTTCCAGAGCCCTGACGCCAGCCCCGTGCGCCTCTTCTCCACCTACGACCGGCGCACCAACATCACACGCGAGCACTTCAGCCAGACGGAGCTGCTGGGCCGGCCCTACGGCACCCAGCGTGGCGTGTACCTGCTCACTAGCCACCGCACCGCCACCGCGGCCGAGGAGCTGGCCTTCCTCATGCAGTCACTGGGCTGGGCCACGCTGGTGGGCGAGATCACCGCGGGCAGCCTGCTGCACACACACACAGTATCCCTGCTGGAGACGCCCGAGGGCGGCCTGGCGCTCACGGTGCCTGTGCTCACCTTCATCGACAACCATGGCGAGTGCTGGCTGGGGGGCGGTGTGGTCCCCGATGCCATTGTGCTGGCCGAGGAAGCCCTAGACAGAGCCCAGGAGGTGCTGGAGTTCCACCGAAGCTTGGGGGAGTTGGTGGAAGGCACGGGGCGCCTGCTGGAGGCCCACTACGCTCGGCCAGAGGTCGTGGGGCAGATGGGTGCCCTGCTGCGAGCCAAGCTGGCCCAGGGGGCCTACCGCACCGCGGTGGACCTGGAGTCGCTGGCTTCCCAGCTTACGGCCGACCTGCAGGAGATGTCTGGGGACCACCGTCTGCTGGTGTTCCACAGCCCCGGCGAAATGGTGGCTGAGGAGGCGCCCCCACCGCCTCCCGTCGTCCCCTCCCCGGAGGAGCTGTCCTATCTCATCGAGGCCCTGTTCAAGACTGAGGTGCTGCCCGGCCAGCTGGGCTACCTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTGAAGGCCGTCGGGCCACAGCTGGTGCAGCTGGTGTGGCAGAAGCTGGTGGACACGGCCGCGCTGGTGGTCGACCTGCGCTACAACCCCGGCAGCTACTCCACAGCCGTGCCTCTACTCTGCTCCTACTTCTTCGAGGCAGAGCCCCGCCGGCACCTCTACTCTGTCTTTGACAGGGCCACGTCAAGGGTCACAGAGGTATGGACCCTGCCCCACGTTACAGGCCAGCGCTATGGCTCCCACAAGGACCTCTACGTTCTGGTGAGCCACACCAGCGGTTCAGCAGCTGAGGCTTTTGCTCACACCATGCAGGATCTGCAGCGAGCCACCATCATCGGGGAGCCCACGGCCGGAGGGGCACTCTCCGTGGGAATCTACCAGGTGGGCAGCAGCGCCTTATACGCCTCCATGCCCACGCAGATGGCCATGAGTGCCAGCACCGGCGAGGCCTGGGATCTGGCTGGGGTGGAGCCGGACATCACTGTGCCCATGAGCGTGGCCCTCTCCACAGCCCGGGACATAGTGACCCTGCGTGCCAAGGTGCCCACTGTGCTGCAGACAGCTGGGAAGCTCGTAGCGGATAACTACGCCTCCCCTGAGCTGGGAGTCAAGATGGCAGCCGAACTGAGTGGTCTGCAGAGCCGCTATGCCAGGGTGACCTCAGAAGCCGCCCTCGCCGAGCTGCTGCAAGCCGACCTGCAGGTGCTGTCCGGGGACCCACACCTGAAGACAGCTCATATACCTGAGGATGCCAAAGACCGCATTCCTGGCATTGTACCCATGCAGTAACAG
ATGGACATGATGGACGGCTGCCAGTTCTCGCCCTCTGAGTACTTCTACGACGGCTCCTGCATCCCATCCCCCGACGGTGAGTTCGGGGACGAGTTTGAGCCGCGAGTGGCTGCTTTCGGGGCTCACAAGGCAGACCTGCAAGGCTCAGACGAGGACGAGCACGTGCGAGCACCCACGGGCCACCACCAGGCCGGCCACTGCCTCATGTGGGCCTGCAAAGCATGCAAAAGGAAGTCCACCACCATGGATCGGCGGAAGGCGGCCACCATGCGCGAGCGGAGACGCCTGAAGAAGGTCAACCAGGCTTTCGACACGCTCAAGCGGTGCACCACGACCAACCCTAACCAGAGGCTGCCCAAGGTGGAGATCCTCAGGAATGCCATCCGCTACATTGAGAGTCTGCAGGAGCTGCTTAGGGAACAGGTGGAAAACTACTATAGCCTGCCGGGGCAGAGCTGCTCTGAGCCCACCAGCCCCACCTCAAGTTGCTCTGATGGCATGTAAATG
Was it helpful?

Solution

The key to this is take what you had so far as a functional unit of what you were trying to build. Your code worked for one sequence; extending it to multiple sequences is simply a matter of getting rid of the assumption that only one sequence is being considered:

use strict;
use warnings;
use 5.010;

use Data::Dumper ();

sub count_dimers {
    my ($sequence) = @_;

    my %counts;
    $counts{substr($sequence, $_, 2)}++ for 0..length($sequence) - 2;

    my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
    %counts = map { $_ => $counts{$_} } @dimers;

    return %counts;
}

open(my $fh, '<', 'sample.txt');

my @counts_by_sequence;
while (my $sequence = <$fh>) {
    my %sequence_counts = count_dimers($sequence);
    push @counts_by_sequence, \%sequence_counts;
}

my %total_counts;
for my $sequence_counts (@counts_by_sequence) {
    for my $dimer (keys %$sequence_counts) {
        $total_counts{$dimer} += ${ $sequence_counts}{$dimer};
    }
}

say Data::Dumper->Dump(
    [\%total_counts, \@counts_by_sequence],
    [qw(total_count counts_by_sequence)]
);

I left the output as an exercise to the reader, but the general shape of the transformation should be evident: What was once the entirety of the program is now a function that gets called once per sequence with the results per call stored and the results over all calls totaled.

OTHER TIPS

#!/usr/bin/perl

use strict;

my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
my @dimers_totals;

my $i = 1;
for(<>)
{
    my $sequence = $_;
    print("Sequence $i:\n");
    my $j = 0;
    for(@dimers)
    {
        my $number =()= $sequence =~ /$_/gi;
        $dimers_totals[$j++] += $number;
        print "\t$_: $number\n"
    }
    print("\n");
    $i++;
}

print("Totals:\n");
$i = 0;
for(@dimers)
{
    print("\t$_: $dimers_totals[$i++]\n");
}

Use like

./dimers_count.pl < sample.txt

My first thought was a global match in scalar context with an adjustment to pos() to back up one. This way, I only have to scan the string once for all dimers (whereas the other answers so far scan the sequence once for every dimer). I maintain too hashes; one for the total and one for the sequence. I'm using the $. special variable that holds the current input line number to label the sequence:

use Data::Printer;

while( <DATA> ) {
    while( /\G([ATGC]{2})/g ) {
        $total{$1}++;
        $by_sequence{$.}{$1}++;
        pos() = pos() - 1
        }
    }

p( %total );
p( %by_sequence );

Doing the same thing with substr took a bit much more work because I couldn't limit the substring to just the characters that mattered. I had to pay attention to newlines and odd counts at the end:

use Data::Printer;

LINE: while( <DATA> ) {
    chomp;
    my $pos = 0;
    SUB: while( my $sub = substr( $_, $pos, 2 ) ) {
        last SUB unless 2 == length $sub;
        $total{$sub}++;
        $by_sequence{$.}{$sub}++;
        $pos++;
        }
    }

p( %total );
p( %by_sequence );

The output from Data::Printer is very nice. I've been using it in preference to Data::Dumper since I found out about it at YAPC::Brasil. (And, by the way, its author will be at YAPC::NA in Madison this year to talk about it):

{
    AA   101,
    AC   215,
    AG   268,
    AT   106,
    CA   286,
    CC   388,
    CG   201,
    CT   310,
    GA   239,
    GC   376,
    GG   369,
    GT   168,
    TA   61,
    TC   206,
    TG   317,
    TT   73
}
{
    1   {
        AA   9,
        AC   3,
        AG   8,
        AT   8,
        CA   11,
        CC   12,
        CG   3,
        CT   11,
        GA   5,
        GC   9,
        GG   8,
        GT   6,
        TA   2,
        TC   13,
        TG   10,
        TT   7
    },
    2   {
        AA   68,
        AC   177,
        AG   219,
        AT   80,
        CA   230,
        CC   329,
        CG   168,
        CT   264,
        GA   195,
        GC   316,
        GG   317,
        GT   146,
        TA   50,
        TC   169,
        TG   271,
        TT   54
    },
    3   {
        AA   24,
        AC   35,
        AG   41,
        AT   18,
        CA   45,
        CC   47,
        CG   30,
        CT   35,
        GA   39,
        GC   51,
        GG   44,
        GT   16,
        TA   9,
        TC   24,
        TG   36,
        TT   12
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top