Downloading Protein Sequences of multiple Organisms

Question 1

Try to download the sequence from PATRIC's FTP, which is a gold mine, first it is much better organized and second, the data are A LOT cleaner than NCBI. PATRIC is backed by NIH by the way.

PATRIC contains some 15000+ genomes and provides their DNA, protein, the DNA of protein coding regions, EC, pathway, genbank in separate files. Super convenient. Have a look yourself there:

ftp://ftp.patricbrc.org/patric2.

I suggest you download all the desired files from all organisms first and then pick up those you need once you have them all on your hard drive. The following python script download the ec number annotation files provided by PATRIC in one go (if you have proxy, you need to config it in the comment section):

from ftplib import FTP
import sys, os

#######if you have proxy

####fill in you proxy ip here
#site = FTP('1.1.1.1')

#site.set_debuglevel(1)
#msg = site.login('anonymous@ftp.patricbrc.org')

site = FTP("ftp.patricbrc.org")
site.login()
site.cwd('/patric2/current_release/ec/')

bacteria_list = []
site.retrlines('LIST', bacteria_list.append)

output = sys.argv[1]
if not output.endswith("/"):
    output += "/"

print "bacteria_list: ", len(bacteria_list)


for c in bacteria_list:

    path_name = c.strip(" ").split()[-1]

    if "PATRIC.ec" in path_name:

        filename = path_name.split("/")[-1]
        site.retrbinary('RETR ' + path_name, open(output + filename , 'w').write)

Question 2

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at.

urllib2 HTTP Error 400: Bad Request

urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others