Using Scrapy to crawl the urls in the webpage

https://stackoverflow.com/questions/7804011

25-10-2019
|

سؤال

I'm using scrapy to extract data from certain websites.The problem is that my spider can only crawl the webpage of initial start_urls , it can't crawl the urls in the webpage. I copied the same spider exactly:

    from scrapy.spider import BaseSpider
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from scrapy.utils.response import get_base_url
    from scrapy.utils.url import urljoin_rfc
    from nextlink.items import NextlinkItem

    class Nextlink_Spider(BaseSpider):
        name = "Nextlink"
        allowed_domains = ["Nextlink"]
        start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')          
        for site in sites:
            relative_url = site.extract()
            url =  self._urljoin(response,relative_url)
            yield Request(url, callback = self.parsetext)

    def parsetext(self, response):
        log = open("log.txt", "a")
        log.write("test if the parsetext is called")
        hxs = HtmlXPathSelector(response)
        items = []
        texts = hxs.select('//div').extract()
        for text in texts:
            item = NextlinkItem()
            item['text'] = text
            items.append(item)
            log = open("log.txt", "a")
            log.write(text)
        return items

    def _urljoin(self, response, url):
        """Helper to convert relative urls to absolute"""
        return urljoin_rfc(response.url, url, response.encoding)

I use the log.txt to test if the parsetext is called.However, after I runned my spider, there is nothing in the log.txt.

المحلول

See here:

http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.

So, as long as you didn't activate the OffsiteMiddleware in your settings, it doesn't matter and you can leave allowed_domains completely out.

Check the settings.py whether the OffsiteMiddleware is activated or not. It shouldn't be activated if you want to allow your Spider to crawl on any domain.

نصائح أخرى

I think the problem is, that you didn't tell Scrapy to follow each crawled URL. For my own blog I've implemented a CrawlSpider that uses LinkExtractor-based Rules to extract all relevant links from my blog pages:

# -*- coding: utf-8 -*-

'''
*   This program is free software: you can redistribute it and/or modify
*   it under the terms of the GNU General Public License as published by
*   the Free Software Foundation, either version 3 of the License, or
*   (at your option) any later version.
*
*   This program is distributed in the hope that it will be useful,
*   but WITHOUT ANY WARRANTY; without even the implied warranty of
*   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
*   GNU General Public License for more details.
*
*   You should have received a copy of the GNU General Public License
*   along with this program.  If not, see <http://www.gnu.org/licenses/>.
*
*   @author Marcel Lange <info@ask-sheldon.com>
*   @package ScrapyCrawler 
 '''


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

import Crawler.settings
from Crawler.items import PageCrawlerItem


class SheldonSpider(CrawlSpider):
    name = Crawler.settings.CRAWLER_NAME
    allowed_domains = Crawler.settings.CRAWLER_DOMAINS
    start_urls = Crawler.settings.CRAWLER_START_URLS
    rules = (
        Rule(
            LinkExtractor(
                allow_domains=Crawler.settings.CRAWLER_DOMAINS,
                allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
                deny=Crawler.settings.CRAWLER_DENY_REGEX,
                restrict_css=Crawler.settings.CSS_SELECTORS,
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback='parse_item',
            process_links='filter_links'
        ),
    )

    # Filter links with the nofollow attribute
    def filter_links(self, links):
        return_links = list()
        if links:
            for link in links:
                if not link.nofollow:
                    return_links.append(link)
                else:
                    self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
        return return_links

    def parse_item(self, response):
        # self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
        item = PageCrawlerItem()
        item['status'] = response.status
        item['title'] = response.xpath('//title/text()')[0].extract()
        item['url'] = response.url
        item['headers'] = response.headers
        return item

On https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/ I've described detailed how I've implemented a website crawler to warm up my Wordpress fullpage cache.

My guess would be this line:

allowed_domains = ["Nextlink"]

This isn't a domain like domain.tld, so it would reject any links. If you take the example from the documentation: allowed_domains = ["dmoz.org"]

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow