Scrapy search form follows an inexistent page

https://stackoverflow.com/questions/23152475

05-07-2023
|

Pregunta

I'm trying to scrape the results from certain keywords using the advanced search form of The Guardian.

from scrapy.spider import Spider
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector

class IndependentSpider(Spider):
    name = "IndependentSpider"
    start_urls= ["http://www.independent.co.uk/advancedsearch"]

    def parse(self, response):
        yield [FormRequest.from_response(response, formdata={"all": "Science"}, callback=self.parse_results)]

    def parse_results(self):
        hxs = HtmlXPathSelector(response)
        print hxs.select('//h3').extract()

The form redirects me to

DEBUG: Redirecting (301) to <GET http://www.independent.co.uk/ind/advancedsearch/> from <GET http://www.independent.co.uk/advancedsearch>

which is a page that doesn't seem to exist.

Do you know what I am doing wrong?

Thanks!

Solución

It seems you need a trailing /.

Try start_urls= ["http://www.independent.co.uk/advancedsearch/"]

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow