Pergunta

I am trying to scrap search results of http://www.ncbi.nlm.nih.gov/pubmed. I gathered all useful information on first page but i am having problem in navigation of second page (second page do not have any result, some parameters in request are missing or wrong).

My code is:

 class PubmedSpider(Spider):
    name = "pubmed"
    cur_page = 1
    max_page = 3
    start_urls = [
            "http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug"
    ]

    def parse(self, response):
        sel = Selector(response)
        pubmed_results = sel.xpath('//div[@class="rslt"]')
        #next_page_url = sel.xpath('//div[@id="gs_n"]//td[@align="left"]/a/@    href').extract()[0]
        self.cur_page = self.cur_page + 1
        print 'cur_page ','*' * 30, self.cur_page

        form_data = {'term':'cancer+drug+toxic+',
                    'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page':'results',
                    'email_subj':'cancer+drug+toxic+-+PubMed',
                    'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.CurrPage':str(self.cur_page),
                    'email_subj2':'cancer+drug+toxic+-+PubMed',
                    'EntrezSystem2.PEntrez.DbConnector.LastQueryKey':'2',
                    'EntrezSystem2.PEntrez.DbConnector.Cmd':'PageChanged',
                    'p%24a':'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page',
                    'p%24l':'EntrezSystem2',
                    'p%24':'pubmed',
                    }

        for pubmed_result in pubmed_results:
            item = PubmedItem()

            item['title'] = lxml.html.fromstring(pubmed_result.xpath('.//a')[0].extract()).text_content()
            item['link'] = pubmed_result.xpath('.//p[@class="title"]/a/@href').extract()[0]

            #modify following lines
            if self.cur_page < self.max_page:
                yield FormRequest("http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug",formdata = form_data,
                callback = self.parse2, method="POST")

            yield item

    def parse2(self, response):
        with open('response_html', 'w')as f:
            f.write(response.body)

cookies are enabled in settings.py

Foi útil?

Solução

If you search the NCBI for information why don't you use the E-Utilities designed for this type of research? This would avoid abuse notifications returned from the site (perhaps this happened with your scraper too).

I know the question is quite old, however it can happen that somebody stumbles upon the same question...

Your base URL would be: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer+toxic+drug

You can find a description of the query parameters here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch (for more results per query and how you can advance)

And using this API would you enable you to use some other tools and a newer Python 3 too.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top