Вопрос

I need to scrape images and/or the urls of the images from a website for a client. I have a list of URLs that take you to a specific product page containing product images, product description...and so on (as shown below).

My example URL list (using Firebox's website purely because it has a good example of my problem):

http://www.firebox.com/product/6078/Back-To-The-Future-iPad-Case?via=hp&s=1x1&t=livefeed
http://www.firebox.com/product/5773/Scientific-Spice-Rack?via=hp&s=1x1&t=random
http://www.firebox.com/product/6147/Rice-Cube?via=related

The problem I am having:

If you go to one of these URLs you can see that the website shows a number of product images.

If you then click one of the images it then displays a larger version of the image. It is this larger dimension images that I require (typically around 980 x 980) that are displayed when you click to zoom the image rather than the smaller images (typically around 250 x 250) that are initially displayed when you go to the product page.

enter image description here

The reason the client needs the larger dimension images is because he displays large images on his website that are automatically scaled to 1200 x 1200 and when smaller images are used if they are not already this size it makes the image look blurry/pixelated.

How would I go about getting these larger dimension images from the site rather than the smaller ones?

I have looked at possible ways to do this using Perl (as this is what I know) such as Selenium, but cannot see a way it is possible to actually get the larger dimension images. This is a new topic to me.

Это было полезно?

Решение

The problem is just about extracting the link to the large image from HTML source of the page. If you view the HTML source of http://www.firebox.com/product/5773/Scientific-Spice-Rack?via=hp&s=1x1&t=random you can actually see the links to big images, see part of HTML below:

<img class="extra_thumb" data-item="0"   data-sku="sku14014"  data-zoom-image="http://media.firebox.com/pic/p5773_column_grid_12.jpg" data-caption="" data-image="http://media.firebox.com/pic/p5773_column_grid_6.jpg"  src="http://media.firebox.com/pic/p5773_column_grid_1.jpg"/>

You can see the link to the large image in data-zoom-image or data-image so it's all about extracting it with regex. You can use Perl, Python or many other languages for this. Here is a short Perl example:

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use LWP::Simple;
use feature 'say';
use List::MoreUtils qw( uniq );

my $content = get('http://www.firebox.com/product/5773/Scientific-Spice-Rack?via=hp&s=1x1&t=random');
die "Couldn't get it!" unless defined $content;

my (@big_images) = $content =~ /data-zoom-image\=\"([^\"]+?)\"/g;
say for uniq @big_images;

Above prints:

http://media.firebox.com/pic/p5773_column_grid_12.jpg
http://media.firebox.com/pic/p5773_extra1_column_grid_12.jpg
http://media.firebox.com/pic/p5773_extra2_column_grid_12.jpg
http://media.firebox.com/pic/p5773_s14019_column_grid_12.jpg
http://media.firebox.com/pic/p5773_s14014_column_grid_12.jpg
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top