Pregunta

I'm relatively new to CasperJS, have wrote simple scraping scripts, and now I'm in a kind of more difficult task: I want to scrape some sort of data from a list of urls, but some pages some times "fail", I've a captcha solving service because a few of this pages have captcha by default, but phantomjs is rather inconsistent in rendering some captchas, sometimes they load, sometimes they don't.

The solution I thought was to rerun the script with the pages that failed to load the captcha in order to get the amount of data I need. But I don't seem to get it running, I thought of creating a function with the whole thing and then inside the casper.run() method invoke it and check if the amount of data scraped fulfills the minimum I need if not rerun, But I don't really know how to accomplish it, as for what I've seen casperjs adds the steps to the stack before calling the function (correct me if I'm wrong). Also I'm thinking of something using the run.complete event but not so sure how to do it. My script is something like this:

// This variable stores the amount of data collected
pCount = 0;
urls = ["http://page1.com","http://page2.com"];    
// Create casperjs instance...
casper.start();

casper.eachThen(urls, function(response) {
    if (pCount < casper.cli.options.number) {
        casper.thenOpen(response.data, function(response) {
        // Here is where the magic goes on
        })
    }
})
casper.run();

Is there anyway I can wrap the casper.eachThen() block in a function and do something like this?

casper.start();
function sample () {
    casper.eachThen(urls, function(response) {
        if (pCount < casper.cli.options.number) {
            casper.thenOpen(response.data, function(response) {
            // Here is where the magic goes on
            })
        }
    })
}
casper.run(sample);

Also, I tried using slimerjs as engine to avoid the "inconsistencies", but I couldn't manage to get working the __utils__.sendAjax() method inside a casper.evaluate() I have, so it's a deal-breaker. Or is there a way to do a GET request asynchronously in a separate instance? if so, I would appreciate your advise

Update: I never managed to solve it with casperjs, I nonetheless found a workaround for my particular use case, check my answer for more info

¿Fue útil?

Solución 2

I never found a way to do this from casper, this is how I solved it:

There's a program A, that manages user input (in my case written in C#). This program A is the one that executes the casperjs script, and read it's output. If I need to rerun the script, I just output a message with some specifications so that I catch it in the program A.

It may not be the best way, but it worked for me. Hope it helps

Otros consejos

Maybe with the back function, so something like that :

casper.start()
.thenOpen('your url');
.then(function(){
    var count = 0;
    if (this.exists("selector contening the captcha")){
    //continue the script
    }
    else if (count==3){
        this.echo("in 3 attempts, it failed each time");
        this.exit();
    }
    else{
        count++;
        casper.back();//back to the previous step, so will re-open the url
    }
.run();
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top