The 'subject' argument for matchPattern
is a special object (e.g. XString). You can convert your sequences to XStrings by collapsing them with paste and using ?BString
.
So, with your data:
file = read.fasta(file = "mydata.txt")
# find 'atg' locations
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
matchPattern("atg", string)
})
atg[1:2]
# $a
# Views on a 18-letter BString subject
# subject: atgacccccaccgagtaa
# views:
# start end width
# [1] 1 3 3 [atg]
#
# $b
# Views on a 21-letter BString subject
# subject: atgcccactgtcatcacctaa
# views:
# start end width
# [1] 1 3 3 [atg]
For a simple example, finding the number and locations of 'atg's in a sequence:
sequence <- BString("atgatgccatgcccccatgcatgatatg")
result <- matchPattern("atg", sequence)
# Views on a 28-letter BString subject
# subject: atgatgccatgcccccatgcatgatatg
# views:
# start end width
# [1] 1 3 3 [atg]
# [2] 4 6 3 [atg]
# [3] 9 11 3 [atg]
# [4] 17 19 3 [atg]
# [5] 21 23 3 [atg]
# [6] 26 28 3 [atg]
# Find out how many 'atg's were found
length(result)
# [1] 6
# Get the start site of each 'atg'
result@ranges@start
# [1] 1 4 9 17 21 26
Also, check out ?DNAString
and ?RNAString
. They are similar to BString
only they are limited to nucleotide characters, and allow for quick comparisons between DNA and RNA sequences.
Edit to address frame shifting concern mentioned in the comments: You can subset the result to get those 'atg's that are in frame using the modulo trick mentioned by @DWin.
# assuming the first 'atg' sets the frame
in.frame.result <- result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
# Views on a 28-letter DNAString subject
# subject: ATGATGCCATGCCCCCATGCATGATATG
# views:
# start end width
# [1] 1 3 3 [ATG]
# [2] 4 6 3 [ATG]
# There are two 'atg's in frame in this result
length(in.frame.result)
# [1] 2
# With your data:
file = read.fasta(file = "mydata.txt")
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
result <- matchPattern("atg", string)
result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
})