Can I create a hash key in perl like this (lowerR-10,UpperR-12) => 1
Question
I want to create a hash key in perl hash key that looks like this (lowerR-10,UpperR-12) => 1. Here the key is (lowerR-10,UpperR-12) and its value is 1.
Actually I have a file like this. I have to find the overlap among the the elements.
A 10 12
A 10 15
Whose output will be
A 10 12 2
A 12 15 1
The last column shows the overlap among the elements. I would like to save the count in a hash for which I think the key should be like (lowerR-10,UpperR-12) this. If anyone can give some new suggestion regarding how to save the key it will be great too.
Thanks
Solution
Maybe the program below will get you close to a solution.
#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use Sort::Naturally;
my %data;
while (<DATA>) {
my ($chr, @start_stop) = split;
$data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}
for my $chr (nsort keys %data) {
my %counts;
while (my ($range_int, $count) = each %{ $data{$chr} } ) {
push @{ $counts{$count} }, $range_int;
}
for my $count (sort {$a <=> $b} keys %counts) {
my $set = Set::IntSpan->new(@{$counts{$count}});
for my $run ($set->sets) {
printf "%s %-10s count: %s\n", $chr, $run, $count;
}
}
print "\n";
}
__DATA__
chr1 100 500
chr1 25 50
chr1 10 50
chr1 60 80
chr1 12 40
chr1 41 45
chr1 20 45
chr1 48 80
chr1 4 60
chr2 2 40
chr3 4 90
chr1 5 40
chr2 1 30
chr1 6 20
chr4 9 100
chr1 2 20
chr2 2 90
chr1 6 20
chr4 4 30
chr2 4 90
chr3 3 90
chr2 4 90
chr4 3 90
chr2 4 30
It produced this output.
chr1 2-3 count: 1
chr1 100-500 count: 1
chr1 4 count: 2
chr1 51-59 count: 2
chr1 61-80 count: 2
chr1 5 count: 3
chr1 46-47 count: 3
chr1 60 count: 3
chr1 48-50 count: 4
chr1 6-9 count: 5
chr1 21-24 count: 5
chr1 41-45 count: 5
chr1 10-11 count: 6
chr1 25-40 count: 6
chr1 12-19 count: 7
chr1 20 count: 8
chr2 1 count: 1
chr2 2-3 count: 3
chr2 41-90 count: 3
chr2 31-40 count: 4
chr2 4-30 count: 6
chr3 3 count: 1
chr3 4-90 count: 2
chr4 3 count: 1
chr4 91-100 count: 1
chr4 4-8 count: 2
chr4 31-90 count: 2
chr4 9-30 count: 3
Update: I'll try to explain. The %counts
hash is created anew for each chromosome from the outer loop. The keys are the counts of each numbered position, say number 42 was seen 5 times. The value for each count is an anonymous array that has all numbers that were seen 5 times.
Set::IntSpan is used to create ranges, (6-8, 21-24, 41-45), from the anonymous array, (which has 6,7,8,21,22,23,24,41,42,43,44,45 as elements in the anon array). The line for my $run ($set->sets)
, gets each run list for numbers seen 5 times, (6-8, 21-24, 41-45) and then prints them. You can look at the documentation for Set::IntSpan although it doesn't provide many helpful examples, and I've not been able to find any other good examples by net search, sorry. But basically, you feed Set::IntSpan ranges of numbers and it can give you the condensed subsets, (6-8, 21-24, etc), or each individual number in a set depending on the Set::IntSpan method you use to access the data held by a IntSpan object.
Hope this clears up some questions you had. :-)
OTHER TIPS
Actually I don't think a hash-based approach lends itself well to this problem; it is better to approach it as a list processing problem. I'll describe an approach in Haskell (this could be translated to Perl with only slightly more code).
The general approach is to keep a running tally of how many elements are overlapping at any given time. To do this, we'll build a list of "deltas" - changes in the running total - from the argument list of elements as (start, end) pairs. Then sort these deltas by where they occur, and combine ones that occur at the same time. Finally, keep a running sum of the deltas, which is equal to the number of overlapping elements at the start of each period (each period ends where the next begins).
Complete code:
import Data.List -- sortBy
import Data.Ord -- comparing
import Data.Function -- `on`
withDelta delta x = (x, delta)
episodeToDelta elements = increases ++ decreases
where increases = map (withDelta 1 . fst) elements
decreases = map (withDelta (-1) . snd) elements
compactDelta::[(Integer,Integer)]->(Integer,Integer)
compactDelta (x:xs) = ((fst x),(sum $ map snd (x:xs)))
adjustTotal (t, delta) ((t1, total1):rest) = (t, total1 + delta):(t1,total1):rest
deltasToRunningTotal = reverse . foldr adjustTotal [(0,0)] . reverse
noZeros = filter (\(_, d) -> d /= 0)
elementsToRunningTotal = deltasToRunningTotal . noZeros . map compactDelta . groupBy ((==) `on` fst) . sortBy (comparing fst) . elementToDelta
sample_elements = [(10,15),(5,9),(13,22),(15,19),(14,16),(3,8),(2,12),(20,22),(23,29)]
> sortBy (comparing fst) sample_elements
[(2,12),(3,8),(5,9),(10,15),(13,22),(14,16),(15,19),(20,22),(23,29)]
> episodesToRunningTotal sample_elements
[(0,0),(2,1),(3,2),(5,3),(8,2),(9,1),(10,2),(12,1),(13,2),(14,3),(16,2),(19,1),(20,2),(22,0),(23,1),(29,0)]
I thought that showing what the hashes, (%data, %counts for each chromosome), might show better what each hash looks like. I use Data::Dumper often when I'm not sure how a hash is structured. It is a very useful module. If you run the program (below) it should better show the hash contents.
You should probably print the output in a file as there is too much the view on (my) terminal - like: perl your_name_program.pl > junk.txt
.
Then open junk.txt to see the results.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Sort::Naturally qw/ nsort /;
my %data;
while (<DATA>) {
my ($chr, @start_stop) = split;
$data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}
print "Printing \%data hash as follows\n";
print Dumper \%data;
print "\n";
for my $chr (nsort keys %data) {
my %counts;
while (my ($range_int, $count) = each %{ $data{$chr} } ) {
push @{ $counts{$count} }, $range_int;
}
print "Printing \%counts hash for chromosome $chr\n";
print Dumper \%counts;
print "\n";
}
__DATA__
chr1 100 500
chr1 25 50
chr1 10 50
chr1 60 80
chr1 12 40
chr1 41 45
chr1 20 45
chr1 48 80
chr1 4 60
chr2 2 40
chr3 4 90
chr1 5 40
chr2 1 30
chr1 6 20
chr4 9 100
chr1 2 20
chr2 2 90
chr1 6 20
chr4 4 30
chr2 4 90
chr3 3 90
chr2 4 90
chr4 3 90
chr2 4 30
(I'm not too clear on how to translate the Haskell solution to Perl at this point, but will give a closer examination later.)
Hope this gives a clearer picture.