Add a new Field
to your SiteScraperItem
Item
class and set it to response.url
in the parse()
method.
Adding Scrapy request URL into Parsed Array
-
09-07-2023 - |
Question
I'm using the below Scrapy code, which is fully functioning, to scrape data from from a website. The scraper inputs a text list of product IDs, which are generated into a URL on line 10. How can I add the current start_url as an additional element to my item array?
from scrapy.spider import Spider
from scrapy.selector import Selector
from site_scraper.items import SiteScraperItem
class MySpider(Spider):
name = "product"
allowed_domains = ["site.com"]
url_list = open("productIDs.txt")
base_url = "http://www.site.com/p/"
start_urls = [base_url + url.strip() for url in url_list.readlines()]
url_list.close()
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//span[@itemprop='name']")
items = []
item = SiteScraperItem()
item ["Classification"] = titles.xpath("//div[@class='productSoldMessage']/text()").extract()[1:]
item ["Price"] = titles.xpath("//span[@class='pReg']/text()").extract()
item ["Name"] = titles.xpath("//span[@itemprop='name']/text()").extract()
try:
titles.xpath("//link[@itemprop='availability']/@href").extract()[0] == 'http://schema.org/InStock'
item ["Availability"] = 'In Stock'
except:
item ["Availability"] = 'Out of Stock'
if len(item ["Name"]) == 0:
item ["OnlineStatus"] = 'Offline'
item ["Availability"] = ''
else:
item ["OnlineStatus"] = 'Online'
items.append(item)
return items
I am exporting this data to CSV using the below command line code and would like the URL to be an additional value in my CSV file.
scrapy crawl product -o items.csv -t csv
Thanks in advance for your help!
Solution
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow