Question

I have a script that performs BLAST queries (bl2seq)

The script works like this:

  1. Get sequence a, sequence b
  2. write sequence a to filea
  3. write sequence b to fileb
  4. run command 'bl2seq -i filea -j fileb -n blastn'
  5. get output from STDOUT, parse
  6. repeat 20 million times

The program bl2seq does not support piping. Is there any way to do this and avoid writing/reading to the harddrive?

I'm using Python BTW.

Was it helpful?

Solution

How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.

Also, since you are using Python, an alternative you can use is BioPython module.

OTHER TIPS

Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.

Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.

If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.

Wow. I have it figured out.

The answer is to use python's subprocess module and pipes!

EDIT: forgot to mention that I'm using blast2 which does support piping.

(this is part of a class)

def _query(self):
    from subprocess import Popen, PIPE, STDOUT
    pipe = Popen([BLAST,
    '-p', 'blastn',
    '-d', self.database,
    '-m', '8'],
    stdin=PIPE,
    stdout=PIPE)
    pipe.stdin.write('%s\n' % self.sequence)
    print pipe.communicate()[0]

where self.database is a string containing the database filename, ie 'nt.fa' self.sequence is a string containing the query sequence

This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.

Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.

I call blast2 using R script:

....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo  sequence1 > seq1"), wait = FALSE)
system("echo  sequence2 > seq2"), wait = FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....

This is 2 times slower(!) vs. writing and reading from hard drive!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top