Extracting raw sequence strings
We will assume from now on that the file aa.fa
contains some aminoacid FASTA sequences. Let's start with a simple example that extracts a list of sequences.
import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (seqdata)
import qualified Data.ByteString.Lazy.Char8 as LB
main = do
sequences <- readFasta "aa.fa"
let listOfSequences = map (LB.unpack . seqdata) sequences :: [String]
-- Just for show, we will print one sequence per line here
-- This will basically execute putStrLn for each sequence
mapM_ putStrLn listOfSequences
readFasta
returns IO [Sequence Unknown]
. Basically that means there is no information about whether the sequences contain Aminoacids or nucleotides.
Note that we use LB.unpack
instead of show
here, because show
adds double quotes ("
) at the beginning and the end of the resulting String
. Using LB.unpack
works, because in the current BioHaskell version 0.5.3., SeqData
is just defined as lazy ByteString
.
We can fix this by using castToAmino
or castToNuc
:
Converting to AA/Nucleotide sequences
let aaSequences = map castToAmino sequences :: [Sequence Amino]
Note that those function currently (BioHaskell version 0.5.3) do not perform any validity checks. You can use the [Sequence Amino]
or [Sequence Nuc]
in the BioHaskell algorithms.
Lookup sequence by FASTA header
We will now assume that our aa.fa
contains a sequence
>abc123
MGLIFARATNA...
Now, we will build a Map String String
(we will use Data.Map.Strict
in this example) from the FASTA file. We can use this map to lookup the sequence.
The lookup will yield a Maybe String
. The intended behaviour in this example is to print the sequence if it was found, or not to print anything if nothing was found in the Map
.
As Data.Maybe
is a Monad, we can use Data.Foldable.mapM_
for this task.
import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (Sequence, seqdata, seqheader)
import qualified Data.ByteString.Lazy.Char8 as LB
import Data.Foldable (mapM_)
import qualified Data.Map.Strict as Map
-- | Convert a Sequence to a String tuple (sequence label, sequence)
sequenceToMapTuple :: Sequence a -> (String, String)
sequenceToMapTuple s = (LB.unpack $ seqheader s, LB.unpack $ seqdata s)
main = do
sequences <- readFasta "aa.fa"
-- Build the sequence map (by header)
let sequenceMap = Map.fromList $ map sequenceToMapTuple sequences
-- Lookup the sequence for the "abc123" header
mapM_ print $ Map.lookup "abc123" sequenceMap
Edit: Thanks to @GabrielGonzalez suggestion, the final example now uses Data.Foldable.mapM_
instead of Data.Maybe.fromJust