I'm relatively new to CasperJS, have wrote simple scraping scripts, and now I'm in a kind of more difficult task: I want to scrape some sort of data from a list of urls, but some pages some times "fail", I've a captcha solving service because a few of this pages have captcha by default, but phantomjs is rather inconsistent in rendering some captchas, sometimes they load, sometimes they don't.
The solution I thought was to rerun the script with the pages that failed to load the captcha in order to get the amount of data I need. But I don't seem to get it running, I thought of creating a function with the whole thing and then inside the casper.run()
method invoke it and check if the amount of data scraped fulfills the minimum I need if not rerun, But I don't really know how to accomplish it, as for what I've seen casperjs adds the steps to the stack before calling the function (correct me if I'm wrong). Also I'm thinking of something using the run.complete
event but not so sure how to do it. My script is something like this:
// This variable stores the amount of data collected
pCount = 0;
urls = ["http://page1.com","http://page2.com"];
// Create casperjs instance...
casper.start();
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
casper.run();
Is there anyway I can wrap the casper.eachThen()
block in a function and do something like this?
casper.start();
function sample () {
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
}
casper.run(sample);
Also, I tried using slimerjs as engine to avoid the "inconsistencies", but I couldn't manage to get working the __utils__.sendAjax()
method inside a casper.evaluate()
I have, so it's a deal-breaker. Or is there a way to do a GET request asynchronously in a separate instance? if so, I would appreciate your advise
Update: I never managed to solve it with casperjs, I nonetheless found a workaround for my particular use case, check my answer for more info