Question

I want to create a hash key in perl hash key that looks like this (lowerR-10,UpperR-12) => 1. Here the key is (lowerR-10,UpperR-12) and its value is 1.

Actually I have a file like this. I have to find the overlap among the the elements.

A 10 12

A 10 15

Whose output will be

A 10 12 2

A 12 15 1

The last column shows the overlap among the elements. I would like to save the count in a hash for which I think the key should be like (lowerR-10,UpperR-12) this. If anyone can give some new suggestion regarding how to save the key it will be great too.

Thanks

Was it helpful?

Solution

Maybe the program below will get you close to a solution.

#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use Sort::Naturally;

my %data;

while (<DATA>) {
    my ($chr, @start_stop) = split; 
    $data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}

for my $chr (nsort keys %data) {
    my %counts;
    while (my ($range_int, $count) = each %{ $data{$chr} } ) {
        push @{ $counts{$count} }, $range_int;  
    }

    for my $count (sort {$a <=> $b} keys %counts) {
        my $set = Set::IntSpan->new(@{$counts{$count}});
        for my $run ($set->sets) {
            printf "%s %-10s count: %s\n", $chr, $run, $count;
        }
    }
    print "\n";
}

__DATA__
chr1    100    500
chr1    25      50
chr1    10       50
chr1    60       80
chr1    12       40
chr1    41       45
chr1     20      45
chr1     48      80
chr1    4   60
chr2    2   40
chr3    4   90
chr1    5   40
chr2    1   30
chr1    6   20
chr4    9   100
chr1    2   20
chr2    2   90
chr1    6   20
chr4    4   30
chr2    4   90
chr3    3   90
chr2    4   90
chr4    3   90
chr2    4   30

It produced this output.

chr1 2-3        count: 1
chr1 100-500    count: 1
chr1 4          count: 2
chr1 51-59      count: 2
chr1 61-80      count: 2
chr1 5          count: 3
chr1 46-47      count: 3
chr1 60         count: 3
chr1 48-50      count: 4
chr1 6-9        count: 5
chr1 21-24      count: 5
chr1 41-45      count: 5
chr1 10-11      count: 6
chr1 25-40      count: 6
chr1 12-19      count: 7
chr1 20         count: 8

chr2 1          count: 1
chr2 2-3        count: 3
chr2 41-90      count: 3
chr2 31-40      count: 4
chr2 4-30       count: 6

chr3 3          count: 1
chr3 4-90       count: 2

chr4 3          count: 1
chr4 91-100     count: 1
chr4 4-8        count: 2
chr4 31-90      count: 2
chr4 9-30       count: 3

Update: I'll try to explain. The %counts hash is created anew for each chromosome from the outer loop. The keys are the counts of each numbered position, say number 42 was seen 5 times. The value for each count is an anonymous array that has all numbers that were seen 5 times.

Set::IntSpan is used to create ranges, (6-8, 21-24, 41-45), from the anonymous array, (which has 6,7,8,21,22,23,24,41,42,43,44,45 as elements in the anon array). The line for my $run ($set->sets), gets each run list for numbers seen 5 times, (6-8, 21-24, 41-45) and then prints them. You can look at the documentation for Set::IntSpan although it doesn't provide many helpful examples, and I've not been able to find any other good examples by net search, sorry. But basically, you feed Set::IntSpan ranges of numbers and it can give you the condensed subsets, (6-8, 21-24, etc), or each individual number in a set depending on the Set::IntSpan method you use to access the data held by a IntSpan object.

Hope this clears up some questions you had. :-)

OTHER TIPS

Actually I don't think a hash-based approach lends itself well to this problem; it is better to approach it as a list processing problem. I'll describe an approach in Haskell (this could be translated to Perl with only slightly more code).

The general approach is to keep a running tally of how many elements are overlapping at any given time. To do this, we'll build a list of "deltas" - changes in the running total - from the argument list of elements as (start, end) pairs. Then sort these deltas by where they occur, and combine ones that occur at the same time. Finally, keep a running sum of the deltas, which is equal to the number of overlapping elements at the start of each period (each period ends where the next begins).

Complete code:

import Data.List      -- sortBy
import Data.Ord       -- comparing
import Data.Function  -- `on`

withDelta delta x = (x, delta)

episodeToDelta elements = increases ++ decreases
  where increases = map (withDelta 1    . fst) elements
        decreases = map (withDelta (-1) . snd) elements

compactDelta::[(Integer,Integer)]->(Integer,Integer)
compactDelta (x:xs) = ((fst x),(sum $ map snd (x:xs)))

adjustTotal (t, delta) ((t1, total1):rest) = (t, total1 + delta):(t1,total1):rest

deltasToRunningTotal = reverse . foldr adjustTotal [(0,0)] . reverse

noZeros = filter (\(_, d) -> d /= 0)

elementsToRunningTotal = deltasToRunningTotal . noZeros . map compactDelta . groupBy ((==) `on` fst) . sortBy (comparing fst) . elementToDelta

sample_elements = [(10,15),(5,9),(13,22),(15,19),(14,16),(3,8),(2,12),(20,22),(23,29)]

> sortBy (comparing fst) sample_elements
[(2,12),(3,8),(5,9),(10,15),(13,22),(14,16),(15,19),(20,22),(23,29)]

> episodesToRunningTotal sample_elements
[(0,0),(2,1),(3,2),(5,3),(8,2),(9,1),(10,2),(12,1),(13,2),(14,3),(16,2),(19,1),(20,2),(22,0),(23,1),(29,0)]

I thought that showing what the hashes, (%data, %counts for each chromosome), might show better what each hash looks like. I use Data::Dumper often when I'm not sure how a hash is structured. It is a very useful module. If you run the program (below) it should better show the hash contents.

You should probably print the output in a file as there is too much the view on (my) terminal - like: perl your_name_program.pl > junk.txt.

Then open junk.txt to see the results.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Sort::Naturally qw/ nsort /;

my %data;

while (<DATA>) {
    my ($chr, @start_stop) = split; 
    $data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}

print "Printing \%data hash as follows\n";
print Dumper \%data;
print "\n";

for my $chr (nsort keys %data) {
    my %counts;
    while (my ($range_int, $count) = each %{ $data{$chr} } ) {
        push @{ $counts{$count} }, $range_int;  
    }
    print "Printing \%counts hash for chromosome $chr\n";
    print Dumper \%counts;
    print "\n";
}

__DATA__
chr1    100    500
chr1    25      50
chr1    10       50
chr1    60       80
chr1    12       40
chr1    41       45
chr1     20      45
chr1     48      80
chr1    4   60
chr2    2   40
chr3    4   90
chr1    5   40
chr2    1   30
chr1    6   20
chr4    9   100
chr1    2   20
chr2    2   90
chr1    6   20
chr4    4   30
chr2    4   90
chr3    3   90
chr2    4   90
chr4    3   90
chr2    4   30

(I'm not too clear on how to translate the Haskell solution to Perl at this point, but will give a closer examination later.)

Hope this gives a clearer picture.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top