Question

What's is the best safe way to extract items information from pages? I mean, sometimes a item may be missing in the page and you'll end up breaking the crawler.

Look this example:

    for cotacao in tabela_cotacoes:
        citem = CotacaoItem()
        citem['name'] = cotacao.select("td[4]/text()").extract()[0]
        citem['symbol'] = cotacao.select("td/a/b/text()").extract()[0]
        citem['current'] = cotacao.select("td[6]/text()").extract()[0]
        citem['last_neg'] = cotacao.select("td[7]/text()").extract()[0]
        citem['oscillation'] = cotacao.select("td[8]/text()").extract()[0]
        citem['openning'] = cotacao.select("td[9]/text()").extract()[0]
        citem['close'] = cotacao.select("td[10]/text()").extract()[0]
        citem['maximum'] = cotacao.select("td[11]/text()").extract()[0]
        citem['minimun'] = cotacao.select("td[12]/text()").extract()[0]
        citem['volume'] = cotacao.select("td[13]/text()").extract()[0]

If some item is missing in the page, .extract() will return [] and calling [0] on them will raise an exception (out of range).

So the question is, what is the best way/approach to deal with it this.

Was it helpful?

Solution

Write a little helper function:

def extractor(xpathselector, selector):
    """
    Helper function that extract info from xpathselector object
    using the selector constrains.
    """
    val = xpathselector.select(selector).extract()
    return val[0] if val else None

And use it like this:

citem['name'] = extractor(cotacao, "td[4]/text()")

Return an appropriate value to indicate that a citem wasn't found. In my code I returned None, change it if necessary (for example, return '' if it makes sense).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top