سؤال

So I have a sequence of nucleotides and I need to count the number of times the word gaga appears in the sequence. This is what I have so far:

dna=c("a","g","c","t")
N=16
x=sample(dna,N,4)
x2=paste(x,collapse="")
x2

Here is an example output:

gtaggcctaattataa

Eventually, I am going to write a loop to make this run 100 times and plot a histogram of the counts of the word "gaga." So, my main question is: How can I write a function or code to search through the string x2 and count the number of occurrences of the word "gaga."

Any help would be appreciated! Thank you!

هل كانت مفيدة؟

المحلول

?regex
sapply( gregexpr( "gaga", c("gtaggcctaattataa", 
                            "gtaggcctaatgagaataa", 
                            "gagagaga") ) ,
        function(x) if( x[1]==-1 ){ 0 }else{ length(x) } )
[1] 0 1 2

نصائح أخرى

This is actually a wrapper for DWin's solution found in the qdap package:

x<- c("gtaggcctaattataa", "gtaggcctaatgagaataa", "gagagaga")

library(qdap)
qdap:::termco.h(x, "gaga", seq_along(x))

##   3 word.count term(gaga)
## 1 1          1          0
## 2 2          1          1
## 3 3          1          2

If you want just the counts:

qdap:::termco.h(x, "gaga", 1:3)[, 3]

Use stri_count_fixed from stringi package

    dna=c("a","g","c","t")
    N=160
    x=sample(dna,N,4)
    x2 <- stri_paste(x,collapse="")
    stri_count_fixed(x2,"gaga")
    ## 2

Here's an approach that counts overlaps too:

vec <- c("gagatttt",
"ttttgaga",
"gaga",
"tttgagattt",
"gagagaga",
"gagaga")


lengths(strsplit(vec, "ga(?=ga)", perl = TRUE)) - 1L
# [1] 1 1 1 1 3 2
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top