Scrapy Python Craigslist Scraper

https://stackoverflow.com/questions/15456577

24-03-2022
|

Question

I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale.

I am able to extract date, post title, and post url but am having trouble extracting price.

For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty.

Can someone please review the code below and help me out?

from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from craigslist_sample.items import CraigslistSampleItem

    class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://longisland.craigslist.org/search/sss?sort=date&query=raptor%20660&srchType=T"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item['date'] = titles.select('span[@class="itemdate"]/text()').extract()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        item ['price'] = titles.select('//span[@class="itempp"]/text()').extract()
        items.append(item)
    return items

Solution

itempp appears to be inside of another element, itempnr. Perhaps it would work if you were to change //span[@class="itempp"]/text() to span[@class="itempnr"]/span[@class="itempp"]/text().

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow