Question

I have set of EST sequences in fasta file. Here, how to subset based on sequence ID or name?

>gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence
GCAGCCGTCGGATCGTGAGCGTGATCGCGTGGCTAGTCGGGTTGGCGAAATGGTTGGATGATATCCGGAG
GTGGAGGAACCCCATTACCACGGTATTGGTCCACATCTTATATTTAGTGCTTGTTTGGTACCCGGATTTG
ATTGTCCCAACCGGGTTTTTATATGTGTTCCTAATCGGTGTATGGTACTATCGGTTTCGGCCCAAGATAC
CAGCGGGTATGGATACCCGACTCTCACAAGCTGAAGCGGTTGACCCGGATGAGCTTGATGAGGAATTCGA
CACCATACCGAGCTCAAAACCACCCGACATAATCAGGGTCCGGTATGACCGGTTGCGGATATTGGCAGCC
CGGGTTCAAACGGTTTTGGGTGATTTTGCAACACAAGGGGAGCGGGTTCAGGCCTTGGTTAGCTGGAGGG
ACCCAAGGGCCACAAAATTGTTCATAGGCGTGTGCTTGGCCATAACAATAATTCTCTATGTGGTGCCACC
CAAAATGGTTGCCGTGGCACTTGGATTCTACTATTTACGACACCCCATGTTCCGAGACCCCATGCCTCCT
GCAAGCTTGAATTTCTTCAGAAGGCTTCCAAGCCTTTCAGACCGCTTTAATGTAGATTAGAATATTATAT
GATTATTAGTAGGCCCAA

>gi|296783887|gb|GW992814.1|GW992814 UAS-Mi9 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Dehydration-responsive protein RD22, Similar to BURP domain-containing protein like gene from Morus indica, mRNA sequence
AAGCAGTGGTCTAGAACCAGAGTGGCCCCTGCGATGCAGGTATCATCTCTATTATCAAAAGGGATAAGGG
GTGGATCCGTCGGGGATTTGAGTCTCACATGGTCGCTGATAACTTATTGAATGGATATTGGATTGTGTGC
AGTGCGACCTAAACAGGATTGCCGTTGGGGCCTGTGGTCAGAGATACCCCACACTTCTCAACTCCCAAAT
TGGATCTTGTTCCTTGTTTTCCTGTATTAAGCCTGACCCCTGAGGCTTTCGCCACTGCCAACTGGGTGCC
GCCTGCTGACTTCTGATTCCCCGTGCTAACGGTTACTCCCGATTCCTTATCCACATCGAAGATGAACTAT
TGACTTCCGCAAACTCAAAAGGCTGCAAGATATCACTGACCGCTGTCGGGATCCGCGATCGGCATATACG
CGAAATCCGATCCCGGATCCCGGGACTGCAGACGGCTGAA

Like using header line >gi|296783888|gb|GW992815.1|GW992815 UAS-Mi10 Complementary DNA of mulberry (Morus indica) Morus indica cDNA 5' similar to Putative phosphoribosyltransferase/phosphoribosylanthranilate-like gene from Morus indica, mRNA sequence or using only >gi|296783888
How to do this in R?

Was it helpful?

Solution

For a slightly more heavy-weight solution, if this fits in to the Bioconductor work flow,

source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

to install the Biostrings and Rsamtools package, then

library(Rsamtools)
indexFa("foo.fasta")   # create an index of file 'foo.fasta'
fa = FaFile("foo.fasta")  # reference the fasta file and it's index

You can discover the coordinates (names and start / end) of each sequence with

gr = as(seqinfo(fa), "GRanges")

and query for arbitrary sequences and ranges within sequences by choosing appropriate subsets, e.g., the second sequence and then first sequence in your example

getSeq(fa, gr[2:1])

or by looking up the coordinates by partial match to the names

idx = pmatch("gi|296783888", names(gr))  ## NA's if duplicates or not unique
seq = getSeq(fa, gr[idx])

"seq" is a DNAStringSet, and can be manipulated in many ways; see the vignettes available in the package

vignette(package="Biostrings")

especially the Quick Overview. To save the object to a fasta file 'file.fa' in a directory 'some' relative to the current working directory, use

writeXStringSet(seq, "some/file.fa")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top