Question

I have a table like this:

>head(X)
column1    column2
sequence1 ATCGATCGATCG
sequence2 GCCATGCCATTG

I need an output in a fasta file, looking like this:

sequence1  
ATCGATCGATCG
sequence2  
GCCATGCCATTG

So, basically I need all entries of the 2nd column to become new rows, interspersing the first column. The old 2nd column can then be discarded.

The way I would normally do that is by replacing a whitespace (or tab) with \n in notepad++, but I fear my files will be too big for doing that.

Is there a way for doing that in R?

Was it helpful?

Solution 2

D <- do.call(rbind, lapply(seq(nrow(X)), function(i) t(X[i, ])))
D
#         1             
# column1 "sequence1"   
# column2 "ATCGATCGATCG"
# column1 "sequence2"   
# column2 "GCCATGCCATTG"

Then, when you write to file, you could use

write.table(D, row.names = FALSE, col.names = FALSE, quote = FALSE)
# sequence1
# ATCGATCGATCG
# sequence2
# GCCATGCCATTG

so that the row names, column names, and quotes will be gone.

OTHER TIPS

I had the same question but found a really easy way to convert a data frame to a fasta file using the package: "seqRFLP".

Do the following: Install and load seqRFLP

install.packages("seqRFLP")
library("seqRFLP")

Your sequences need to be in a data frame with sequence headers in column 1 and sequences in column 2 [doesn't matter if it's nucleotide or amino acid]

Here is a sample data frame

names <- c("seq1","seq2","seq3","seq4")

sequences<-c("EPTFYQNPQFSVTLDKR","SLLEDPCYIGLR","YEVLESVQNYDTGVAK","VLGALDLGDNYR")

df <- data.frame(names,sequences)

Then convert the data frame to .fasta format using the function: 'dataframe2fas'

df.fasta = dataframe2fas(df, file="df.fasta")

When I do this, I tend to use something like:

Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- paste0(">", X$column1)
Xfasta[c(FALSE, TRUE)] <- X$column2

This creates an empty character vector, with length twice the length of your table; then puts the values from column1 in every second position starting at 1, and the values of column2 in every second position starting at 2.

then write using writeLines:

writeLines(Xfasta, "filename.fasta")

In this answer, I added a ">" to the headers since this is standard for fasta format and is required by some tools that take fasta input. If you don't care about adding the ">", then:

Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- X$column1
Xfasta[c(FALSE, TRUE)] <- X$column2

If you didn't read your file in with options to stop characters being read as factors, then you might need to use <- as.character(X$column1) instead. There are also a few tools available for this conversion, I think the Galaxy browser has an option for it.

add ">" to headers

X$column1 <- paste0(">",X$column1)

bind rows of headers ans seqs

seqs_fasta <- c(rbind(X$column1, X$column2))

write fasta

write(x = seqs_fasta, file = "/home/../my_seqs.fasta")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top