Question

I have a Ruby application that uses a CasperJS in order to scrap sites, give results back and do some manipulation with the data.

Right now, my CasperJS script consumes around 200Mb, so that means that if I want to launch in parallel 5 instances, do the math :). I am thinking on deploying this to Heroku, but I am sure I would hit memory limits for the worker.

What could I do to reduce memory usage or to make it scalable? I want to be able to parse more than 10 pages in parallel. Shall I look at another alternative (I really need to be a headless browser because I want to see where all the elements are in the page, not just scraping the HTML)?

Was it helpful?

Solution

I run into the same situation especially with photo intensive sites like Pinterest. In that case I could only run in one session for about 30 minutes before PhantomJS would crash at memory uses of 1GB.

Casper has the option of not downloading images, so it may save memory with phantomJS. I'm assuming the memory is being built up due to page caching. I would love to hear somebody else' opinion on the matter.

I don't think there are too many alternatives out there. PhantomJS with it's limitation is still way faster than Selenium.

OTHER TIPS

@Hommer Smith,

Thinks about Varnish in your Frontend, putting your images in cache and release the work of CasperJS.

Set up CasperJS to use your Varnish before load external pages thus you don't need too much memory.

Just a tips I don't tester the solution yet.

Regards!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top