Pregunta

I am trying to crawl pubmed with python and get the pubmed ID for all papers that an article was cited by.

For example this article (ID: 11825149) http://www.ncbi.nlm.nih.gov/pubmed/11825149 Has a page linking to all articles that cite it: http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149 The problem is it has over 200 links but only shows 20 per page. The 'next page' link is not accessible by url.

Is there a way to open the 'send to' option or view the content on the next pages with python?

How I currently open pubmed pages:

def start(seed):
    webpage = urlopen(seed).read()
    print webpage


    citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
    print citedByPage

From this I can extract all the cited by links on the first page, but how can I extract them from all pages? Thanks.

¿Fue útil?

Solución

I was able to get the cited by IDs using a method from this page http://www.bio-cloud.info/Biopython/en/ch8.html

Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
...                                    LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.

So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).

But first, taking the more straightforward approach of making a second (separate) call to ELink:

>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
...                                     from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).

Now, let’s do that all again but with the history …TODO.

And finally, don’t forget to include your own email address in the Entrez calls.
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top