Scrapy SgmlLinkExtractor question

https://stackoverflow.com/questions/1809817

05-07-2019
|

Question

I am trying to make the SgmlLinkExtractor to work.

This is the signature:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

I am just using allow=()

So, I enter

rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes:

[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)

What am I doing wrong here?

Is there anyone here who used Scrapy successfully who can help me to finish this spider?

Thank you for the help.

I include the code for the spider below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u

class NuSpider(CrawlSpider):
    domain_name = "wcase"
    start_urls = ['xxxxxx/jacevedo/']

    rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = NuItem()
        item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
        return item

SPIDER = NuSpider()

Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.

Solution

You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.

OTHER TIPS

if you check documentation a "Warning" is clearly written

"When writing crawl spider rules, avoid using parse as callback, since the Crawl Spider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work."

url for verification

allow=(r'/aadler/', ...

You are missing comma after first element for "rules" to be a tuple..

rules = (Rule(SgmlLinkExtractor(allow=('/careers/n.\w+', )), callback='parse', follow=True),)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow