Question

Using BioHaskell, how can I read a FASTA file containing aminoacid sequences?

I want to be able to:

  • Get a list of String sequences
  • Get a Map String String (from Data.Map ) from the FASTA comment (assumed to be unique) to the sequence String
  • Use the sequences in algorithms implemented in BioHaskell.

Note: This question intentionally does not show research effort as it was immediately answered in a Q&A-style manner.

Was it helpful?

Solution

Extracting raw sequence strings

We will assume from now on that the file aa.fa contains some aminoacid FASTA sequences. Let's start with a simple example that extracts a list of sequences.

import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (seqdata)
import qualified Data.ByteString.Lazy.Char8 as LB

main = do
    sequences <- readFasta "aa.fa"
    let listOfSequences = map (LB.unpack . seqdata) sequences :: [String]
    -- Just for show, we will print one sequence per line here
    -- This will basically execute putStrLn for each sequence
    mapM_ putStrLn listOfSequences

readFasta returns IO [Sequence Unknown]. Basically that means there is no information about whether the sequences contain Aminoacids or nucleotides.

Note that we use LB.unpack instead of show here, because show adds double quotes (") at the beginning and the end of the resulting String. Using LB.unpack works, because in the current BioHaskell version 0.5.3., SeqData is just defined as lazy ByteString.

We can fix this by using castToAmino or castToNuc:

Converting to AA/Nucleotide sequences

let aaSequences = map castToAmino sequences :: [Sequence Amino]

Note that those function currently (BioHaskell version 0.5.3) do not perform any validity checks. You can use the [Sequence Amino] or [Sequence Nuc] in the BioHaskell algorithms.

Lookup sequence by FASTA header

We will now assume that our aa.fa contains a sequence

>abc123
MGLIFARATNA...

Now, we will build a Map String String (we will use Data.Map.Strict in this example) from the FASTA file. We can use this map to lookup the sequence.

The lookup will yield a Maybe String. The intended behaviour in this example is to print the sequence if it was found, or not to print anything if nothing was found in the Map.

As Data.Maybe is a Monad, we can use Data.Foldable.mapM_ for this task.

import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (Sequence, seqdata, seqheader)
import qualified Data.ByteString.Lazy.Char8 as LB
import Data.Foldable (mapM_)
import qualified Data.Map.Strict as Map

-- | Convert a Sequence to a String tuple (sequence label, sequence)
sequenceToMapTuple :: Sequence a -> (String, String)
sequenceToMapTuple s = (LB.unpack $ seqheader s, LB.unpack $ seqdata s)

main = do
    sequences <- readFasta "aa.fa"
    -- Build the sequence map (by header)
    let sequenceMap = Map.fromList $ map sequenceToMapTuple sequences
    -- Lookup the sequence for the "abc123" header
    mapM_ print $ Map.lookup "abc123" sequenceMap

Edit: Thanks to @GabrielGonzalez suggestion, the final example now uses Data.Foldable.mapM_ instead of Data.Maybe.fromJust

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top