Question

I'm looking for a programmatic way to get all the Uniprot ids (Swiss-Prot + TrEMBL) for a given species (e.g. all the Uniprot ids that end in _MOUSE).

One way to do it would be to decompress and parse the stream at uniprot

Such files are available only for a very small subset of all the species represented in the Uniprot DB. Hence, this solution is not a general one.

My question is: is there a general, and hopefully more efficient, way to do this? (By "more efficient" I mean basically that it does not require such decompressing and parsing.)

Basically I'm wondering if uniprot.org supports a url-based query where I can specify some species identifier (e.g. MOUSE or 10090), and maybe also some field name like UniprotID, and whose response would be a list of all the Uniprot IDs for that species.

Was it helpful?

Solution 2

You can do this using the rest API provided at uniprot.org see the faq on retrieving entries via queries.

Most of the time you want to use the NCBI/UniProt taxonomy identifiers instead of species names. e.g. 10090 instead of "Mus musculus" using ids instead strings is more likely to get the right thing.

The species concepts are getting a bit funny these days with more and more sequencing projects so do pay attention to what you are getting and why.

OTHER TIPS

I haven't looked at the idmapping file that you are taking about. But I've used the following file to get ids for a given species: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speindex.txt

then I parse it like so:

#!/usr/bin/env perl
use strict;
use warnings;

my $spec = shift;
my $re = quotemeta $spec;

my @ids =();
while (<>) {
  if (/$re/../^$/) {
    chomp;
    next if ($_ eq $spec);  # skip species line
    s/^\s+//;               # remove trailing spaces
    push @ids, split(/, ?/, $_);
  }
}

print $_."\n" foreach @ids;

using a command line for 'Mus musculus (Mouse)':

script.pl "Mus musculus (Mouse)" speindex.txt

I hope this helps...Paul

If you do not want to use a flat file, you can use BioServices Python package, which will retrieve the information from UniProt web site:

from bioservices import UniProt
u = UniProt()
results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
print(results)   

the result variable is a string that you need to parse. it contains the uniprot entries and uniprot entry names. The previous command retrieve only 2 entries but if you remove the argument limit=2, you will get all of them.

For instance, to get all entry names, you would type:

results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
entries = [x.split()[1] for x in res.strip().split("\n")[1:]]

This takes a few seconds to download the 17000 entries. If you remove "reviewed:yes", it takes about 30 seconds to a minute.

I hope this is helpful.

For installation with python 2.7, just type:

pip install bioservices
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top