Question

I have to find Simple sequence repeats and have to store each unique repeat along with their position. I have already written a perl code to do that(which does have tons of if and for for finding repeats till pentamers). My question is, is there some simpler way to do this in java, like some regex or something which searches the string and returns me the count of consecutive repeats and positions, something which doesn't involves many control statements and iterations.

Update: A simple sequence repeat(SSR) is simply a uninterrupted repeated string of characters, starting from dimer(i.e two different characters repeating together.). It is like a word being repeated in a sentence continuously without break. In case of DNA, it would look like

AATTAAAATTTTAAAAAAAAGGGCCCTTTAA[ATATATATATATAT]AAGGGATTTAAGGAATTAAGA[TGATGATGATGATGA]TGGTAG

Here AT and TGA are simple sequence repeats, AT is a dimer, and TGA is a trimer. What I have to find is the start position of Sequence repeat,how many times it is being repeated(i.e. the length) and which sequence it is(i.e AT is for example starting at position 6, it is being repeated 10 times, TGA is starting at position 25 and so on)

My perl code :(And it is kind of buggy)

my $i=0;
my $j=0;
my $dna;
my $m2=0;
my $m4=0;
my $m3=0;
my $m5=0;
my $temp1;
my $temp2;
my $min;

print "Please enter DNA sequence : ";
$dna=<>;
my $firstdi;
my $seconddi;
my $thirddi;
my $fourthdi;
my $fifthtet;
my $firsttet;
my $secondtet;
my $thirdtet;
my $fourthtet;
my $fifthtet;
my $firsttri;
my $secondtri;
my $thirdtri;
my $fourttri;
my $fifthtri;
my $firstpent;
my $secondpent;
my $thirdpent;
my $fourtpent;
my $fifthpent;
print "\n";
print "Please enter the length to search for : ";
$motif=<>;
print "\n";
print "Please enter the minimum number of motif repeats : " ;
$min=<>;
chomp($dna);
chomp($motif);
chomp($min);
my @output;
my @codearr = split //, $dna;
print "\n";print @codearr;print "\n";
my $arrsize=@codearr;
print "\nSize : ";
print "\n";print $arrsize; print "\n";
print "Output : ";
my $total=0;


if($motif==2)
{
for($i=0;$i<($arrsize-2);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
  $temp1 = join( "",$codearr[$i],$codearr[$i+1]);
  $temp2 = join( "",$codearr[$i+2],$codearr[$i+3]);
   if($temp1 eq $temp2)
 {
 if($m2==0)
 {
 $ms1=$i;
 }
 $total++;
 $m2++;
 }
}
}
}



if($motif==3)
{
for($i=0;$i<($arrsize-2);$i=$i+$motif)
{if($codearr[$i] ne $codearr[$i+1])
    {
  $temp1 = join( "",$codearr[$i],$codearr[$i+1]);
  $temp2 = join( "",$codearr[$i+2],$codearr[$i+3]);
   if($temp1 eq $temp2)
 {
  if($m2==0)
 {
 $ms1=$i;
 }
 $m2++;
  $total++;
 }
}
}

for($i=0;$i<($arrsize-3);$i=$i+$motif)
{if($codearr[$i] ne $codearr[$i+1])
    {
$temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2]);
$temp2 = join( "",$codearr[$i+3],$codearr[$i+4],$codearr[$i+5]);
 if($temp1 eq $temp2)
 {
  if($m3==0)
 {
 $ms3=$i;
 }
 $m3++;
  $total++;
 }
}
}
}



if($motif==4)
{
for($i=0;$i<($arrsize-2);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
  $temp1 = join( "",$codearr[$i],$codearr[$i+1]);
  $temp2 = join( "",$codearr[$i+2],$codearr[$i+3]);
   if($temp1 eq $temp2)
 {
  if($m2==0)
 {
 $ms1=$i;
 }
 $m2++;
  $total++;
 }
}
}
for($i=0;$i<($arrsize-3);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
$temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2]);
$temp2 = join( "",$codearr[$i+3],$codearr[$i+4],$codearr[$i+5]);
 if($temp1 eq $temp2)
 {
  if($m3==0)
 {
 $ms3=$i;
 }
 $m3++;
  $total++;
 }
}
}
for($i=0;$i<($arrsize-4);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
$temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2],$codearr[$i+3]);
$temp2 = join( "",$codearr[$i+4],$codearr[$i+5],$codearr[$i+6],$codearr[$i+7]);
 if($temp1 eq $temp2)
 {
  if($m4==0)
 {
 $ms4=$i;
 }
 $m4++;
  $total++;
 }
}
}
}



if($motif==5)
{
for($i=0;$i<($arrsize-2);$i=$i+$motif)
{if($codearr[$i] ne $codearr[$i+1])
    {
  $temp1 = join( "",$codearr[$i],$codearr[$i+1]);
  $temp2 = join( "",$codearr[$i+2],$codearr[$i+3]);
   if($temp1 eq $temp2)
 {
  if($m2==0)
 {
 $ms1=$i;
 }
  $total++;
 $m2++;
 }
}  }
for($i=0;$i<($arrsize-3);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
$temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2]);
$temp2 = join( "",$codearr[$i+3],$codearr[$i+4],$codearr[$i+5]);
 if($temp1 eq $temp2)
 {
  if($m3==0)
 {
 $ms3=$i;
 }
  $total++;
 $m3++;
 }
}
}
for($i=0;$i<($arrsize-4);$i=$i+$motif)
{
    if($codearr[$i] ne $codearr[$i+1])
    {
$temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2],$codearr[$i+3]);
$temp2 = join( "",$codearr[$i+4],$codearr[$i+5],$codearr[$i+6],$codearr[$i+7]);
 if($temp1 eq $temp2)
 {
  if($m4==0)
 {
 $ms4=$i;
 }
  $total++;
 $m4++;
 }
}
}
for($i=0;$i<($arrsize-5);$i=$i+$motif)
{if($codearr[$i] ne $codearr[$i+1])
    {
 $temp1 = join( "",$codearr[$i],$codearr[$i+1],$codearr[$i+2],$codearr[$i+3],$codearr[$i+4]);
 $temp2 = join( "",$codearr[$i+5],$codearr[$i+6],$codearr[$i+7],$codearr[$i+8],$codearr[$i+9]);
 if($temp1 eq $temp2)
 {
  if($m5==0)
 {
 $ms5=$i;
 }
  $total++;
 $m5++;
 }
}
}
}


if($motif==2)
{
  if($min<$total)
  {
print"Number of Dimer repeats : ";
    print $m2;
    print"\n";
    print"First position : ";
    print $ms1;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
}
else
  {
  print "No or less than minimum SSRs found";}
}

if($motif==3)
{
  if($min<$total)
  {
print"Number of Dimer repeats : ";
    print $m2;
    print"\n";
    print"First position : ";
    print $ms1;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Trimer repeats : ";
    print $m3;
    print"\n";
    print"First position : ";
    print $ms3;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
}
else
  {
  print "No or less than minimum SSRs found";}
}

if($motif==4)
{
  if($min<$total)
  {
print"Number of Dimer repeats : ";
    print $m2;
    print"\n";
    print"First position : ";
    print $ms1;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Trimer repeats : ";
    print $m3;
    print"\n";
    print"First position : ";
    print $ms3;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Tetramer repeats : ";
    print $m4;
    print"\n";
    print"First position : ";
    print $ms4;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
}
else
  {
  print "No or less than minimum SSRs found";}
}


if($motif==5)
{
  if($min<$total)
    {
print"Number of Dimer repeats : ";
    print $m2;
    print"\n";
    print"First position : ";
    print $ms1;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Trimer repeats : ";
    print $m3;
    print"\n";
    print"First position : ";
    print $ms3;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Tetramer repeats : ";
    print $m4;
    print"\n";
    print"First position : ";
    print $ms4;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
print"Number of Pentamer repeats : ";
    print $m5;
    print"\n";
    print"First position : ";
    print $ms5;
    print "\n";
    print "Sequence Lenght : ";
    print $arrsize;
    print "\n";
  }
  else
  {
  print "No or less than minimum SSRs found";}
}
Was it helpful?

Solution

Something like regexes should help you get started This applies to "dimers" you stated in your question and can be extended to find longer sequences.

String s = "AATTAAAATTTTAAAAAAAAGGGCCCTTTAAATATATATATATATAAGGGATTTAAGGAATTAAGATGATGATGATGATGATGGTAG";
Pattern pattern = Pattern.compile("([ATGC][ATGC])\\1+");
Matcher matcher = pattern.matcher(s);

while (matcher.find()) {
    System.out.print("Start index: " + matcher.start());
    System.out.print(" End index: " + matcher.end());
    System.out.println(" Found: " + matcher.group());
}

This gives an output:

Start index: 4 End index: 8 Found: AAAA
Start index: 8 End index: 12 Found: TTTT
Start index: 12 End index: 20 Found: AAAAAAAA
Start index: 31 End index: 45 Found: ATATATATATATAT

A lexing library might actually help you as you can build state machines out of these regexes which help identify more complex patterns. Take a look at JLex.


Edit: You said AA didn't count as a pattern, and that they had to be different characters. You can try this regex instead:

Pattern pattern = Pattern.compile("(?:([ATGC])(?!\\1)([ATGC])\\1\\2)+");

This has a lookahead assertion that ensures it isn't matching the same character.

OTHER TIPS

This looks so much like LZ. Take a look, it might help you:

A high level view of the encoding algorithm is shown here:

Initialize the dictionary to contain all strings of length one.
Find the longest string W in the dictionary that matches the current input.
Emit the dictionary index for W to output and remove W from the input.
Add W followed by the next symbol in the input to the dictionary.
Go to Step 2.

http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

You could also use a modified Trie, where the word indication would also store the number of repeats.

http://en.wikipedia.org/wiki/Trie

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top