Question

The goal

I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.

The command

This is almost perfect for my needs (...but not quite):

wget --mirror -nH -np -p -k -E -e robots=off http://mysite

What this does do

  • --mirror : Recursively download the entire site
  • -p : Download all necessary page requisites
  • -k : Convert the URL's to relative paths so I can host them anywhere

What this doesn't do

  • Prevent duplicate downloads
  • Maintain (exactly) the same URL structure

The problem

Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).

The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.

Help?!

I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.

Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!

Was it helpful?

Solution

httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.

Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)

OTHER TIPS

wget description and help

According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.

What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.

Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.

An example, based on your original:

wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite

Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.

Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.

I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top