How can I talk to UniProt over HTTP in Python?

https://stackoverflow.com/questions/715538

23-08-2019
|

Question

I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.

In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:

## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;

my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
  from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
  query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};

my $agent = LWP::UserAgent->new;
push @{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  print STDERR "Checking...\n";
  $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
  die 'Failed, got ' . $response->status_line . 
    ' for ' . $response->request->uri . "\n";

My questions are:

1) How would you do that in Python?

2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?

Solution

question #1:

This can be done using python's urllibs:

import urllib, urllib2
import time
import sys

query = ' '.join(sys.argv)   

# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)    
url = 'http://www.uniprot.org/mapping/'

# fetch the data
try:
    foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
    if e.code == 503:
        # blah blah get the value of the header...
        wait_time = int(e.hdrs.get('Retry-after', 0))
        print 'Sleeping %i seconds...' % (wait_time,)
        time.sleep(wait_time)
        foo = urllib2.urlopen(url, data)


# foo is a file-like object, do with it what you will.
foo.read()

OTHER TIPS

You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.

http://www.ebi.ac.uk/Tools/picr/

I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.

Let's assume that you are using Python 2.5. We can use httplib to directly call the web site:

import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = "" 
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging 

##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] =  "organism:9606 AND antigen" 
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip

conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
   data1 = r1.read()
   print data1  #or do something with it

You could then make a function around creating the query string and you should be away.

check this out bioservices. they interface a lot of databases through Python. https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html

conda install bioservices --yes

in complement to O.rka answer:

Question 1:

from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())

This returns a dataframe with all information about each entry.

Question 2: same answer. This should scale up.

Disclaimer: I'm the author of bioservices

There is a python package in pip which does exactly what you want

pip install uniprot-mapper

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow