Periodically scrape and download all images from a website with javascript auto-scroll

Question 1

The fantastic original script done by TMS no longer works with the new unsplash website. Here is an updated working version.

#!/bin/bash
mkdir -p imgs
I=1
while true ; do # for all the pages
        wget "https://unsplash.com/grid?page=$I" -O tmppage

        grep img.*src.*unsplash.imgix.net tmppage | cut -d'?' -f1 | cut -d'"' -f2 > tmppage.imgs

        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi

        echo "Reading page $I:"
        cat tmppage.imgs | while read IMG; do

                # for all the images on page
                TARGET=imgs/$(basename "$IMG")

                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "file already exists"
                        continue
                fi
                echo -n "downloading (PAGE $I)"

                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

Question 2

First, you have to find out how the autoscrolling script works. The easiest way to do this is not to reverse-engineer the javascript, but to look at the network activity. Easiest way to do this is to use Firebug Firefox plugin and look at the activity in the "Net" panel. You quickly see that the website is organized in pages:

unsplash.com/page/1
unsplash.com/page/2
unsplash.com/page/3
...

When you scroll, the script requests to download succeeding pages.

So, we can actually write a script to download all the pages, parse their html for all the images and download them. If you look at the html code, you see that images are there in nice and unique form:

<a href="http://bit.ly/14nUvzx"><img src="http://31.media.tumblr.com/2ba914db5ce556ee7371e354b438133d/tumblr_mq7bnogm3e1st5lhmo1_1280.jpg" alt="Download &nbsp;/ &nbsp;By Tony&nbsp;Naccarato" title="http://unsplash.com/post/55904517579/download-by-tony-naccarato" class="photo_img" /></a>

The <a href contains URL of the full resolution image. The title attribute contains a nice unique URL that also leads to the image. We will use it to construct nice unique name for the image, much nicer than the one under which it is stored. This nice unique name will also assure that no image is downloaded twice.

Shell script (unsplash.sh)

mkdir imgs
I=1
while true ; do # for all the pages
        wget unsplash.com/page/$I -O tmppage
        grep '<a href.*<img src.*title' tmppage > tmppage.imgs
        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi
        echo "Reading page $I:"
        sed 's/^.*<a href="\([^"]*\)".*title="\([^"]*\)".*$/\1 \2/' tmppage.imgs | while read IMG POST ; do
                # for all the images on page
                TARGET=imgs/`echo $POST | sed 's|.*post/\(.*\)$|\1|' | sed 's|/|_|g'`.jpg
                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "already have"
                        continue
                fi
                echo "downloading"
                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

To make sure this runs every day..

create a wrapper script usplash.cron:

#!/bin/bash

export PATH=... # might not be needed, but sometimes the PATH is not set 
                # correctly in cron-called scripts. Copy the PATH setting you 
                # normally see under console.

cd YOUR_DIRECTORY # the directory where the script and imgs directory is located

{
echo "========================"
echo -n "run unsplash.sh from cron "
date

./unsplash.sh 

} >> OUT.log 2>> ERR.log

Then add this line in your crontab (after issuing crontab -e on the console):

10 3 * * * PATH_to_the/unsplash.cron

This will run the script every day at 3:10.

Question 3

Here's a small python version of the download part. The getImageURLs function looks for the data from http://unsplash.com/page/X for lines which contain the word 'Download', and look for the image 'src' attribute there. It also looks for strings current_page and total_pages (are present in the javascript code) to find out how long to keep going.

Currently, it retrieves all the URLs from all the pages first, and for these URLs the image is downloaded if the corresponding file does not exist locally. Depending on how the page numbering changes over time, it may be somewhat more efficient to stop looking for image URLs as soon as an local copy of a file has been found. The files get stored in the directory in which the script was executed.

The other answer explains very well how to make sure something like this can get executed daily.

#!/usr/bin/env python

import urllib
import pprint
import os

def getImageURLs(pageIndex):
    f = urllib.urlopen('http://unsplash.com/page/' + str(pageIndex))
    data = f.read()
    f.close()

    curPage = None
    numPages = None
    imgUrls = [ ]

    for l in data.splitlines():
        if 'Download' in l and 'src=' in l:
            idx = l.find('src="')
            if idx >= 0:
                idx2 = l.find('"', idx+5)
                if idx2 >= 0:
                    imgUrls.append(l[idx+5:idx2])

        elif 'current_page = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            curPage = int(l[idx+1:idx2].strip())
        elif 'total_pages = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            numPages = int(l[idx+1:idx2].strip())

    return (curPage, numPages, imgUrls)

def retrieveAndSaveFile(fileName, url):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()

    g = open(fileName, "wb")
    g.write(data)
    g.close()

if  __name__ == "__main__":

    allImages = [ ]
    done = False
    page = 1
    while not done:
        print "Retrieving URLs on page", page
        res = getImageURLs(page)
        allImages += res[2]

        if res[0] >= res[1]:
            done = True
        else:
            page += 1

    for url in allImages:
        idx = url.rfind('/')
        fileName = url[idx+1:]
        if not os.path.exists(fileName):
            print "File", fileName, "not found locally, downloading from", url
            retrieveAndSaveFile(fileName, url)

    print "Done."