How to query uniprot.org to get all Uniprot IDs for a given species?

Question 1

You can do this using the rest API provided at uniprot.org see the faq on retrieving entries via queries.

Most of the time you want to use the NCBI/UniProt taxonomy identifiers instead of species names. e.g. 10090 instead of "Mus musculus" using ids instead strings is more likely to get the right thing.

The species concepts are getting a bit funny these days with more and more sequencing projects so do pay attention to what you are getting and why.

Question 2

I haven't looked at the idmapping file that you are taking about. But I've used the following file to get ids for a given species: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speindex.txt

then I parse it like so:

#!/usr/bin/env perl
use strict;
use warnings;

my $spec = shift;
my $re = quotemeta $spec;

my @ids =();
while (<>) {
  if (/$re/../^$/) {
    chomp;
    next if ($_ eq $spec);  # skip species line
    s/^\s+//;               # remove trailing spaces
    push @ids, split(/, ?/, $_);
  }
}

print $_."\n" foreach @ids;

using a command line for 'Mus musculus (Mouse)':

script.pl "Mus musculus (Mouse)" speindex.txt

I hope this helps...Paul

Question 3

If you do not want to use a flat file, you can use BioServices Python package, which will retrieve the information from UniProt web site:

from bioservices import UniProt
u = UniProt()
results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
print(results)

the result variable is a string that you need to parse. it contains the uniprot entries and uniprot entry names. The previous command retrieve only 2 entries but if you remove the argument limit=2, you will get all of them.

For instance, to get all entry names, you would type:

results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
entries = [x.split()[1] for x in res.strip().split("\n")[1:]]

This takes a few seconds to download the 17000 entries. If you remove "reviewed:yes", it takes about 30 seconds to a minute.

I hope this is helpful.

For installation with python 2.7, just type:

pip install bioservices