Question

I have a super simple sample code from https://github.com/sylvinus/node-crawler

var Crawler = require("crawler").Crawler;

var c = new Crawler({
    "maxConnections":10,
    "callback":function(error,result,$) {
        console.log(result.body);
    }
});

c.queue("http://google.com");

The output was

<Buffer 3c 21 64 6f 63 74 79 70 65 20 68 74 6d 6c 3e 3c 68 74 6d 6c 20 69 74 65
6d 73 63 6f 70 65 3d 22 69 74 65 6d 73 63 6f 70 65 22 20 69 74 65 6d 74 79 70 65
 ...>

If I change Url to, let say, c.queue("http://bing.com"), I can get the html text. This behavior varies from website to website and I don't understand.

If I change console.log(result.body); to console.log($('body').text());, google.com site gave me some weird text that doesn't look like what sent to my browser.

If I change the Url to some other website like Pinterest.com, I started to get error in jsdom.js itself.

C:\node_modules\crawler\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:256

        window.document.documentElement.appendChild(script);
                                        ^
TypeError: Cannot call method 'appendChild' of null
    at exports.env.exports.jsdom.env.processHTML (C:\node_modules\crawler\node_m
odules\crawler\node_modules\jsdom\lib\jsdom.js:256:41)
    at Array.forEach (native)
    at exports.env.exports.jsdom.env.processHTML (C:\node_modules\crawler\node_m
odules\crawler\node_modules\jsdom\lib\jsdom.js:239:18)
    at Object.exports.env.exports.jsdom.env (C:\node_modules\crawler\node_module
s\crawler\node_modules\jsdom\lib\jsdom.js:268:5)
    at exports.Crawler.self.onContent.jsd (C:\node_modules\crawler\node_modules\
crawler\lib\crawler.js:243:37)
    at exports.Crawler.self.onContent (C:\node_modules\crawler\node_modules\craw
ler\lib\crawler.js:278:29)
    at fs.readFile (fs.js:176:14)
    at Object.oncomplete (fs.js:297:15)

So I guess this is some error in the code itself, but if not, can someone help to point out my mistake?

Note: jQueryUrl param might be added in new Crawler({}) to get it to work with your location of jQuery file.

Was it helpful?

Solution

Have a look at Cheerio if you haven't already. It might give you more consistent results. I've used it for a crawler a while back and didn't have issues like you describe.

I choose it because it seemed to have a cleaner design.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top