Mirroring a website and maintaining URL structure

Question 1

httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.

Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)

Question 2

wget description and help

According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.

What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.

Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.

An example, based on your original:

wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite

Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.

Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.

I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

Mirroring a website and maintaining URL structure

The goal

The command

What this does do

What this doesn't do

The problem

Help?!