Question

I have my data of pairwise DNA sequences showing similarity in the following way..

AATGCTA|1   AATCGTA|2
AATCGTA|2   AATGGTA|3
AATGGTA|3   AATGGTT|8
TTTGGTA|4   ATTGGTA|5
ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
GGCGGTA|10  AATCGTA|2
GGCGGTA|10  TGCGGTA|11
CAGGCA|12   GAGGCA|13

The above is a sample input file, the original file is few millions rows. I want output to be cluster the overlapping ids based on the common elements between the rows and output them to one single line for each cluster, as below

AATGCTA|1   AATCGTA|2   AATGGTA|3   AATGGTT|8   GGCGGTA|10  TGCGGTA|11
TTTGGTA|4   ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
CAGGCA|12   GAGGCA|13 

I am currently trying to cluster them using mcl and also silix, I was not successful in running silix. But the mcl is currently in progress, I would like to know if there are any other ways smart ways of doing this may be in awk or perl. I appreciate some solution, thank you. (this is my first post I am sorry if I have made some mistake)

Just to make it simpler.. is it easy to say my input is,

1   2
2   3
3   8
4   5
5   9
6   7
10  2
10  11
12  13

and I want output to be,

1   2   3   8   10  11
4   5   9
6   7
12  13
Was it helpful?

Solution

I think this is not really it, but anyway:

use strict;
use warnings;
my @rows;
my %indx;
while(<DATA>) {
  chomp;
  my @v = split (/\s+/);
  my $r = {};
  for my $k (@v) {
    $r = $indx{$k}[0] if defined $indx{$k};
  }
  $r->{$v[0]}++;
  $r->{$v[1]}++;
  # print join(",", @v), "\n";
  push(@{$indx{$v[0]}}, $r);
  push(@{$indx{$v[1]}}, $r);
  push(@rows,  $r);
}
my %seen;
for my $r (@rows) {
  print (join("\t", keys %$r), "\n") if not $seen{$r}++;
}

__DATA__
AATGCTA|1   AATCGTA|2
AATCGTA|2   AATGGTA|3
AATGGTA|3   AATGGTT|8
TTTGGTA|4   ATTGGTA|5
ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
GGCGGTA|10  AATCGTA|2
GGCGGTA|10  TGCGGTA|11
CAGGCA|12   GAGGCA|13

Output:

GGCGGTA|10  AATGCTA|1   AATGGTT|8   AATCGTA|2   AATGGTA|3   TGCGGTA|11
CCTGGTA|9   TTTGGTA|4   ATTGGTA|5
CCCGGTA|6   GCCGGTA|7
CAGGCA|12   GAGGCA|13

OTHER TIPS

as you wished, here comes awk solution:

awk 'BEGIN{f=1}{c=0;
        for(i=1;i<=f;i++){
                if(!a[i]){
                        a[i]=$1" "$2; c=1; break;
                }else if(a[i]~$1){
                        a[i]=a[i]" "$2; c=1; break;
                }else if(a[i]~$2){ a[i]=a[i]" "$1; c=1; break; }
        }
        if(!c){ a[++f]=$1" "$2; c=0; }
} END{for(x=1;x<=f;x++)print a[x]}' DnaFile

the codes above was tested with your simpler input file and original file(with CCGGTTAA etc.) both worked. Output omitted.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top