They have a sitemap you can use, it is probably easier to use.
You can use the SitemapSpider.
문제
Could somebody provide code or examples regarding the subject?
Example HTML:
...
<dd><span class="active">1</span></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=2">2</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=3">3</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=4">4</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=5">5</a></dd>
<dd><a href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=6">6</a></dd>
<dd style="position: absolute; right: 50px;">
<a id="centerZone_vacancyList_gridList_linkNext" href="http://rabota.ua/jobsearch/vacancy_list?regionId=1&pg=2">next »</a>
...
I'd like to crawl the links to obtain one big list of existing vacancies as JSON or XML.
해결책
They have a sitemap you can use, it is probably easier to use.
You can use the SitemapSpider.
다른 팁
Fortunately I've found solution. Hope, it'll be helpful for others...
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy_sample.items import ScrapySampleItem
class ScrapyOrgSpider(BaseSpider):
name = "scrapy"
allowed_domains = ["scrapy.org"]
start_urls = ["http://blog.scrapy.org/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page =
hxs.select("//div[@class='pagination']/a[@class='next_page']/@href").extract()
if not not next_page:
yield Request(next_page[0], self.parse)
posts = hxs.select("//div[@class='post']")
items = []
for post in posts:
item = ScrapySampleItem()
item["title"] = post.select("div[@class='bodytext']/h2/a/text()").extract()
item["link"] = post.select("div[@class='bodytext']/h2/a/@href").extract()
item["content"] = post.select("div[@class='bodytext']/p/text()").extract()
items.append(item)
for item in items:
yield item
!!