Frage

I am extracting data from one page, I have to go deeper of course but I am still stuck on that first page. This is my code:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from street.items import HstreetItem

class MySpider(CrawlSpider):
name = "go-h"
allowed_domains = ["http://somedomain.com"]
start_urls = ["http://somedomain.com"]

def parse(self,response):
    #response = response.replace(body=response.body.replace('\n', '')) # doesn't work
    hxs = HtmlXPathSelector(response)
    details = hxs.select('//tr')
    items = []
    #n = 0
    for detail in details:
        item = HondastreetItem()
        item['url'] = "".join(detail.select('td[@class="Model_LineModel_odd"]/a/@href | td[@class="Model_LineModel_even"]/a/@href').extract()).strip()
        item['model'] = "".join(detail.select('td[@class="Model_LineModel_odd"]/a/text() | td[@class="Model_LineModel_even"]/a/text()').extract())
        item['year'] = "".join(detail.select('td[@class="Model_LineYear_odd"]/text() | td[@class="Model_LineYear_even"]/text()').extract())            
        items.append(item)
    return items

The code works fine and it extracts data through my pipleine into a csv file like it should:

cell 1 | cell2 | cell3
url    | model | year
 .
 .
 .

The problem is that I have lot of empty lines in my csv file. At the beginning exactly 17 lines and then empty lines in between filled lines of my csv file. I think that few tables in front of the crawled table and some rows inside the crawled table that I don't need (like category names) are causing this. I am stuck with this last 24 hours :( I have been trying all solutions that I found via similar questions but nothing worked for me.

Thanks for help!

War es hilfreich?

Lösung

I am quite new to Python and landed here trying to understand scrapy.

From what I understand you must be appending empty lines. So you might try to check if 'item' is not empty before the append statement, e.g.,

if not (item['url'] == "" and item['model'] == "" and item['year'] == ""):
    items.append(item)

Please ignore if I misunderstood the question.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top