Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

https://stackoverflow.com/questions/23632198

21-07-2023
|

質問

I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated.

I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse.

Again, any help would be appreciated.

As a quick second question, does anyone know how to view the list of URLs that are in or were in the list of URLs Scrapy found and scraped?

Thanks!

*******EDIT******* Here is an example of what I am trying to do. I can't figure out part 5.

class AB_Spider(CrawlSpider):
    name = "ab_spider"
    allowed_domains = ["abcdef.com"]
    #start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
    #, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
    start_urls = ["https://www.abcdef.com/page/12345"]

    def parse_abcs(self, response):
        sel = Selector(response)
        URL = response.url

        #part 1: check if a certain element is on the webpage
        last_chk = sel.xpath('//ul/li[@last_page="true"]')
        a_len = len(last_chk)

        #Part 2: if not, then get page via selenium webdriver
        if a_len == 0:
            #OPEN WEBDRIVER AND GET PAGE
            driver = webdriver.Firefox()
            driver.get(response.url)    

        #Part 3: run script to ineract with page until certain element appears
        while a_len == 0:
            print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"

            #scroll down one time
            driver.execute_script("window.scrollTo(0, 1000000000);")

            #get page source and check if last page is there
            selen_html = driver.page_source
            hxs = Selector(text=selen_html)
            last_chk = hxs.xpath('//ul/li[@last_page="true"]')

            a_len = len(last_chk)

        driver.close()

        #Part 4: extract the URLs in the selenium webdriver URL 
        all_URLS = hxs.xpath('a/@href').extract()

        #Part 5: all_URLS add to the Scrapy URLs to be scraped

解決

Just yield Request instances from the method and provide a callback:

class AB_Spider(CrawlSpider):
    ...

    def parse_abcs(self, response):
        ...

        all_URLS = hxs.xpath('a/@href').extract()

        for url in all_URLS:
            yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # Do the parsing here

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow