Question

I have a dataframe with loci names in one column and DNA sequences in the other. I'm trying to use as.DNAbin{ape} or similar to create a DNAbin object.

Here some example data:

x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))

If I try y <- as.DNA(x) R creates a sort of DNAbin object with 4 DNA sequences (the 4 rows of the example) of length 2 (the two columns, I assume), there is no labels and of course the base composition doesn't work either.

The documentation is not very clear, but after playing with the woodmouse example data of the package I think that what I need to do is to create a matrix with each base as a column and then use as.DNAbin. I.e. in the above example a 4 x 84 matrix (1 column for locus name and 83 for the sequences?). Any advice on how to do this? Or any better idea?

Thanks

Was it helpful?

Solution

First parameter of as.DNAbin should be a matrix or a list containing the DNA sequences, or an object of class "alignment". So, your idea is right.

Given x is the structure from original post, the code below prepares matrix y:

y <- t(sapply(strsplit(x[,2],""), tolower))
rownames(y) <- x[,1]

Then as.DNAbin(y) shows:

4 DNA sequences in binary format stored in a matrix.

All sequences of same length: 83 

Labels: 55548 43297 35309 34468 

Base composition:
    a     c     g     t 
0.289 0.262 0.205 0.244 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top