Will ghost.py allow my users to scrape javascript injected images?

https://stackoverflow.com/questions/19062452

29-06-2022
|

Question

My site, http://whatgoeswiththis.co, has a scraper that takes images from the web and posts to our site. I can get server rendered images no problem, but for sites like https://www.everlane.com/collections/mens-luxury-tees/products/mens-crew-antique, the images are rendered client-side with javascript.

I've succeeded in writing a script on my local machine that uses ghost.py to scrape the images from this site.

However, I've had to install various programs on my laptop like Qt, PySide, PyQt4, and XQuartz. To my knowledge, these aren't libraries I can just add to my app. My question is, is this stack something that is possible to add to my existing Django app that will allow users to scrape these javascript injected images? Or is there another solution used for webapps?

Sites like http://wanelo.com are able to scrape these images - is there something in particular they're using that is an optimal solution?

Thanks for your help, and I apologize if I sound inexperienced (I am but learning!).

Solution

My current answer is: maybe ghost.py works. But only after a lot of prerequisites that I found difficult to install and configure. My solution was to follow the advice of Pyklar to use PhantomJS through the selenium library here: https://stackoverflow.com/a/15699761/2532070.

I was able to switch from beautifulsoup to selenium/phantomjs simply by changing a few lines of code, brew install phantomjs, and pip install selenium.

I hope this helps someone avoid the same struggle!

OTHER TIPS

You can do something like:

g = Ghost()
g.open(url, wait=False)
page, resources = g.wait_for_selector(your_image_css_selector)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow