Scrapy - Follow RSS links

https://stackoverflow.com/questions/2939050

05-10-2019
|

Pergunta

I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...

I am using the following rule:

   rules = (
       Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
           follow=True,
           callback='parse_article'),
       )

(having in mind that rss links are located in the link tag).

I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...

Any help is welcome, Thanks in advance

Solução

CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector

class MySpider(BaseSpider):
    name = 'myspider'

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        links = xxs.select("//link/text()").extract()
        return [Request(x, callback=self.parse_link) for x in links]

You can also try the XPath in the shell, by running for example:

scrapy shell http://blog.scrapy.org/rss.xml

And then typing in the shell:

>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
 u'http://blog.scrapy.org/new-bugfix-release-0101',
 u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']

Outras dicas

There's an XMLFeedSpider one can use nowadays.

I have done it using CrawlSpider:

class MySpider(CrawlSpider):
   domain_name = "xml.example.com"

   def parse(self, response):
       xxs = XmlXPathSelector(response)
       items = xxs.select('//channel/item')
       for i in items: 
           urli = i.select('link/text()').extract()
           request = Request(url=urli[0], callback=self.parse1)
           yield request

   def parse1(self, response):
       hxs = HtmlXPathSelector(response)
       # ...
       yield(MyItem())

but I am not sure that is a very proper solution...

XML Example From scrapy doc XMLFeedSpider

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        #item = TestItem() 
        item = {} # change to dict for removing the class not found error
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow