First, you have to find out how the autoscrolling script works. The easiest way to do this is not to reverse-engineer the javascript, but to look at the network activity. Easiest way to do this is to use Firebug Firefox plugin and look at the activity in the "Net" panel. You quickly see that the website is organized in pages:
unsplash.com/page/1
unsplash.com/page/2
unsplash.com/page/3
...
When you scroll, the script requests to download succeeding pages.
So, we can actually write a script to download all the pages, parse their html for all the images and download them. If you look at the html code, you see that images are there in nice and unique form:
<a href="http://bit.ly/14nUvzx"><img src="http://31.media.tumblr.com/2ba914db5ce556ee7371e354b438133d/tumblr_mq7bnogm3e1st5lhmo1_1280.jpg" alt="Download / By Tony Naccarato" title="http://unsplash.com/post/55904517579/download-by-tony-naccarato" class="photo_img" /></a>
The <a href
contains URL of the full resolution image. The title
attribute contains a nice unique URL that also leads to the image. We will use it to construct nice unique name for the image, much nicer than the one under which it is stored. This nice unique name will also assure that no image is downloaded twice.
Shell script (unsplash.sh)
mkdir imgs
I=1
while true ; do # for all the pages
wget unsplash.com/page/$I -O tmppage
grep '<a href.*<img src.*title' tmppage > tmppage.imgs
if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
break
fi
echo "Reading page $I:"
sed 's/^.*<a href="\([^"]*\)".*title="\([^"]*\)".*$/\1 \2/' tmppage.imgs | while read IMG POST ; do
# for all the images on page
TARGET=imgs/`echo $POST | sed 's|.*post/\(.*\)$|\1|' | sed 's|/|_|g'`.jpg
echo -n "Photo $TARGET: "
if [ -f $TARGET ] ; then # we already have this image
echo "already have"
continue
fi
echo "downloading"
wget $IMG -O $TARGET
done
I=$((I+1))
done
To make sure this runs every day..
create a wrapper script usplash.cron
:
#!/bin/bash
export PATH=... # might not be needed, but sometimes the PATH is not set
# correctly in cron-called scripts. Copy the PATH setting you
# normally see under console.
cd YOUR_DIRECTORY # the directory where the script and imgs directory is located
{
echo "========================"
echo -n "run unsplash.sh from cron "
date
./unsplash.sh
} >> OUT.log 2>> ERR.log
Then add this line in your crontab (after issuing crontab -e
on the console):
10 3 * * * PATH_to_the/unsplash.cron
This will run the script every day at 3:10.