문제

I'm trying to write a scraper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.

My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?

for (var i = 0; i < productLinks.length; i++) {
    productUrl = productLinks[i];
    request(productUrl, function(err, resp, body) {
        if (err)
            throw err;
        $ = cheerio.load(body);
        var imageUrl = $("#bigImage").attr('src'),
            productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
        console.log(productNumber);

    });
};

Example of output:

1461536
1499543

TypeError: Cannot call method 'split' of undefined
도움이 되었습니까?

해결책 2

You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.

var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);

This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.

다른 팁

Since you're not creating a new $ variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $ just as it's being overwritten by another iteration.

So try creating a new variable:

var $ = cheerio.load(body);
^^^ this is the important part

Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load that is asynchronous, but request is). That's how asynchronous I/O works.

To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top