How to fetch and parse all existing pages based on site pager? [closed]

https://stackoverflow.com/questions/17740868

03-06-2022
|

Domanda

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist

Closed 8 years ago.

Improve this question

Could somebody provide code or examples regarding the subject?

Example HTML:

...
<dd><span class="active">1</span></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=2">2</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=3">3</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=4">4</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=5">5</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=6">6</a></dd>
<dd style="position: absolute; right: 50px;">
<a id="centerZone_vacancyList_gridList_linkNext" href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&amp;pg=2">next »</a>
...

I'd like to crawl the links to obtain one big list of existing vacancies as JSON or XML.

Soluzione

They have a sitemap you can use, it is probably easier to use.

You can use the SitemapSpider.

Altri suggerimenti

Fortunately I've found solution. Hope, it'll be helpful for others...

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy_sample.items import ScrapySampleItem

class ScrapyOrgSpider(BaseSpider):
    name = "scrapy"
    allowed_domains = ["scrapy.org"]
    start_urls = ["http://blog.scrapy.org/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        next_page =
            hxs.select("//div[@class='pagination']/a[@class='next_page']/@href").extract()
        if not not next_page:
            yield Request(next_page[0], self.parse)

        posts = hxs.select("//div[@class='post']")
        items = []
        for post in posts:
            item = ScrapySampleItem()
            item["title"] = post.select("div[@class='bodytext']/h2/a/text()").extract()
            item["link"] = post.select("div[@class='bodytext']/h2/a/@href").extract()
            item["content"] = post.select("div[@class='bodytext']/p/text()").extract()
            items.append(item)
        for item in items:
            yield item

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow