Question

I have a dataframe (trip) that contains a column (SNP). It looks like this (but longer, and it has 192 levels):

SNP
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C
...

I want to pattern match and replace on the following criteria:

gsub("G->T", "C->A", trip)
gsub("G->C", "C->G", trip)
gsub("G->A", "C->T", trip)
gsub("A->T", "T->A", trip)
gsub("A->G", "T->C", trip)
gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){
   substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax?
   substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/;
   }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes in bold:

SNP C[T->C]T C[G->C]A G[A->C]A C[T->C]C C[C->A]G T[G->A]C

to:

SNP C[T->C]T G[C->G]T C[T->G]T C[T->C]C C[C->A]G A[C->T]G

Is there a more elegant way to do this?

Was it helpful?

Solution 2

SNP <- as.character(trip$SNP)
SNP
[1] "C[T->C]T" "C[G->C]A" "G[A->C]A" "C[T->C]C" "C[C->A]G" "T[G->A]C"
i <- grep("(A|G)->", SNP)
SNP[i] <- chartr("ACGT", "TGCA", SNP[i])
SNP
[1] "C[T->C]T" "G[C->G]T" "C[T->G]T" "C[T->C]C" "C[C->A]G" "A[C->T]G"

OTHER TIPS

There are probably add on packages to do this better and faster but this would work (I don't think I have exactly what you want but it's close enough you can adapt for what you're after). Note that the first 14 lines are just reworking your data, the solution is only a few lines.

dat <- read.table(text="trip
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C", header=TRUE, stringsAsFactors = FALSE)

replace <- matrix(c("G->T", "%s[C->A]%s",
"G->C", "%s[C->G]%s",
"G->A", "%s[C->T]%s",
"A->T", "%s[T->A]%s",
"A->G", "%s[T->C]%s",
"A->C", "%s[T->G]%s"), ncol=2, byrow=TRUE)


for(i in 1:nrow(replace)) {
    dat$trip[grepl(replace[i, 1], dat$trip)] <- replace[i, 2]
}

sprintf(dat$trip, "/ATCG/TAGC/", "/ATCG/TAGC/")

## [1] "C[T->C]T"                     "/ATCG/TAGC/[C->G]/ATCG/TAGC/"
## [3] "/ATCG/TAGC/[T->G]/ATCG/TAGC/" "C[T->C]C"                    
## [5] "C[C->A]G"                     "/ATCG/TAGC/[C->T]/ATCG/TAGC/"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top