Question

This is what I want to do. I have a list of gene names for example: [ITGB1, RELA, NFKBIA]

Looking up the help in biopython and tutorial for API for entrez I came up with this:

x = ['ITGB1', 'RELA', 'NFKBIA']
for item in x:
    handle = Entrez.efetch(db="nucleotide", id=item ,rettype="gb")
    record = handle.read()
    out_handle = open('genes/'+item+'.xml', 'w') #to create a file with gene name
    out_handle.write(record)
    out_handle.close

But this keeps erroring out. I have discovered that if the id is a numerical id (although you have to make it in to a string to use, '186972394' so:

handle = Entrez.efetch(db="nucleotide", id='186972394' ,rettype="gb")

This gets me the info I want which includes the sequence.

So now to the Question: How can I search gene names (cause I do not have id numbers) or easily convert my gene names to ids to get the sequences for the gene list I have.

Thank you,

Était-ce utile?

La solution

first with the gene name eg: ATK1

item = 'ATK1'
animal = 'Homo sapien' 
search_string = item+"[Gene] AND "+animal+"[Organism] AND mRNA[Filter] AND RefSeq[Filter]"

Now we have a search string to seach for ids

handle = Entrez.esearch(db="nucleotide", term=search_string)
record = Entrez.read(handleA)
ids = record['IdList']

this returns ids as a list if and if no id found it's []. Now lets assume it return 1 item in the list.

seq_id = ids[0] #you must implement an if to deal with <0 or >1 cases
handle = Entrez.efetch(db="nucleotide", id=seq_id, rettype="fasta", retmode="text")
record = handleA.read()

this will give you a fasta string which you can save to a file

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

Autres conseils

Looking at section 8.3 of the tutorial, there appears to be a function that will allow you to search for terms and get the corresponding IDs (I know nothing about this library and even less about biology, so this will potentially be completely wrong :) ).

>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")
>>> record = Entrez.read(handle)
>>> record["Count"]
'25'
>>> record["IdList"]
['126789333', '37222967', '37222966', '37222965', ..., '61585492']

From what I can tell, id refers to an actual ID number as returned by the esearch function (in the IdList attribute of the response). However if you use the term keyword, you can instead run a search and get the IDs of the matched items. Totally untested, but assuming the search supports boolean operators (it looks like AND works), you could try using a query like:

>>> handle = Entrez.esearch(db="nucleotide",term="ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]")
>>> record = Entrez.read(handle)
>>> record["IdList"]
# Hopefully your ids here...

To generate the term to insert, you could do something like this:

In [1]: l = ['ITGB1', 'RELA', 'NFKBIA']

In [2]: ' OR '.join('%s[Gene]' % i for i in l)
Out[2]: 'ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]'

The record["IdList"] could then be converted into a comma-delimited string and passed to the id argument in your original query by using something like:

In [3]: r = ['1234', '5678', '91011']

In [4]: ids = ','.join(r)

In [5]: ids
Out[5]: '1234,5678,91011'
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top