Question

What sort of algorithm should be used to rearrange the FASTA sequences into length order (shortest first)? It needs to sort the sequences into length order, but with all the information displayed, not just the lengths.

I can sort the 'length' of the sequences using Bio::FastaFormat#length, put lengths into an array, then sort:

require 'rubygems'
require 'bio'

file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
  a = seq.length
  seqarray.push a
end

puts seqarray.sort

This displays the sequence lengths in order, but what I need to be able to see is the original FASTA format, in length order.

I can't add the seq.length (length of each sequence) to the seq.entry (entire fasta format) then sort, because seq.length is an integer and seq.entry gives strings. I tried converting seq.length.to_s, adding this to seq.entry, then sorting. This is the closest I've got, unfortunately the lengths are in a string so they order 1,11,111 instead of 1,2,3 etc.:

require 'rubygems'
require 'bio'

file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
  a = (seq.length).to_s + ' = length' + seq.entry
  seqarray.push a
end
puts seqarray.sort

After doing this I tried the above using the sequence_id instead of the entire entry, and not converting the length to strings, but the id has letters in it, so I can't add to the length integers without getting an error message.

So yeah, any suggestions?

Was it helpful?

Solution

I think you can use "how to sort a ruby array of strings by length".

Map the array into a new one using the lambda described in the link.

Like this:

require 'rubygems'
require 'bio'

file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
    seqarray.push seq
end

puts seqarray.sort_by {|x| x.length}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top