How to download a full website?

https://stackoverflow.com/questions/13031147

13-07-2021
|

Domanda

After fixing the code of a website to use a CDN (rewriting all the urls to images, js & css), I need to test all the pages on the domain to make sure all the resources are fetched from the CDN.

All the sites pages are accessible through links, no isolated pages.

Currently I'm using FireBug and checking the "Net" view...

Is there some automated way to give a domain name and request all pages + resources of the domain?

Update:

OK, I found I can use wget as so:

wget -p --no-cache -e robots=off -m -H -D cdn.domain.com,www.domain.com -o site1.log www.domain.com

options explained:

-p - download resources too (images, css, js, etc.)
--no-cache - get the real object, do not return server cached object
-e robots=off - disregard robots and no-follow directions
-m - mirror site (follow links)
-H - span hosts (follow other domains too)
-D cdn.domain.com,www.domain.com - specify witch domains to follow, otherwise will follow every link from the page
-o site1.log - log to file site1.log
-U "Mozilla/5.0" - optional: fake the user agent - useful if server returns different data for different browser
www.domain.com - the site to download

Enjoy!

Soluzione

The wget documentation has this bit in it:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
      wget -E -H -k -K -p http://site/document

The key is the -H option, which means --span-hosts -> go to foreign hosts when recursive. I don't know if this also stands for normal hyperlinks or only for resources, but you should try it out.

You can consider an alternate strategy. You don't need to download the resources to test that they are referenced from the CDN. You can just get the source code for the pages you're interested in (you can use wget, as you did, or curl, or something else) and either:

parse it using a library - which one depends on the language you're using for scripting. Check each <img />, <link /> and <script /> for CDN links.
use regexes to check that the resource urls contain the CDN domain. See this :), although in this limited case it might not be overly complicated.

You should also check all CSS files for url() links - they should also point to CDN images. Depending on the logic of your apllication, you may need to check that the JavaScript code does not create any images that do not come from the CDN.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow