Using wget to recursively fetch a directory with arbitrary files in it

https://stackoverflow.com/questions/273743

shell
wget

07-07-2019
|

Question

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:

http://mysite.com/configs/.vim/

.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?

Solution

You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:

wget --recursive --no-parent http://example.com/configs/.vim/

To avoid downloading the auto-generated index.html files, use the -R/--reject option:

wget -r -np -R "index.html*" http://example.com/configs/.vim/

OTHER TIPS

To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.

wget -m http://example.com/configs/.vim/

If you add the points mentioned by others in this thread, it would be:

wget -m -e robots=off --no-parent http://example.com/configs/.vim/

Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

If --no-parent not help, you might use --include option.

Directory struct:

http://<host>/downloads/good
http://<host>/downloads/bad

And you want to download downloads/good but not downloads/bad directory:

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

wget -r http://mysite.com/configs/.vim/

works for me.

Perhaps you have a .wgetrc which is interfering with it?

To fetch a directory recursively with username and password, use the following command:

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:

wget -r --no-parent http://example.com/configs/.vim/

That's it. It will download into the following local tree: ./example.com/configs/.vim . However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:

wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/

And it will download your file tree only into ./.vim/

In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.

You should be able to do it simply by adding a -r

wget -r http://stackoverflow.com/

Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...

wget --recursive (...)

...only retrieves index.html instead of all files.

Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.

This version downloads recursively and doesn't create parent directories.

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

Usage:

Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow