Question

I want to download a lot of urls in a script but I do not want to save the ones that lead to HTTP errors.

As far as I can tell from the man pages, neither curl or wget provide such functionality. Does anyone know about another downloader who does?

Was it helpful?

Solution

One liner I just setup for this very purpose:

(works only with a single file, might be useful for others)

A=$$; ( wget -q "http://foo.com/pipo.txt" -O $A.d && mv $A.d pipo.txt ) || (rm $A.d; echo "Removing temp file")

This will attempt to download the file from the remote Host. If there is an Error, the file is not kept. In all other cases, it's kept and renamed.

OTHER TIPS

I think the -f option to curl does what you want:

-f, --fail

(HTTP) Fail silently (no output at all) on server errors. This is mostly done to better enable scripts etc to better deal with failed attempts. In normal cases when an HTTP server fails to deliver a document, it returns an HTML document stating so (which often also describes why and more). This flag will prevent curl from outputting that and return error 22. [...]

However, if the response was actually a 301 or 302 redirect, that still gets saved, even if its destination would result in an error:

$ curl -fO http://google.com/aoeu
$ cat aoeu
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/aoeu">here</A>.
</BODY></HTML>

To follow the redirect to its dead end, also give the -L option:

-L, --location

(HTTP/HTTPS) If the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code), this option will make curl redo the request on the new place. [...]

Ancient thread.. landed here looking for a solution... ended up writing some shell code to do it.

if [ `curl -s -w "%{http_code}" --compress -o /tmp/something \
      http://example.com/my/url/` = "200" ]; then 
  echo "yay"; cp /tmp/something /path/to/destination/filename
fi

This will download output to a tmp file, and create/overwrite output file only if status was a 200. My usecase is slightly different.. in my case the output takes > 10 seconds to generate... and I did not want the destination file to remain blank for that duration.

I have a workaround to propose, it does download the file but it also removes it if its size is 0 (which happens if a 404 occurs).

wget -O <filename> <url/to/file>
if [[ (du <filename> | cut -f 1) == 0 ]]; then
    rm <filename>;
fi;

It works for zsh but you can adapt it for other shells.

But it only saves it in first place if you provide the -O option

NOTE: I am aware that this is an older question, but I believe I have found a better solution for those using wget than any of the above answers provide.

wget -q $URL 2>/dev/null

Will save the target file to the local directory if and only if the HTTP status code is within the 200 range (Ok).

Additionally, if you wanted to do something like print out an error whenever the request was met with an error, you could check the wget exit code for non-zero values like so:

wget -q $URL 2>/dev/null
if [ $? != 0]; then
    echo "There was an error!"
fi

I hope this is helpful to someone out there facing the same issues I was.

Update: I just put this into a more script-able form for my own project, and thought I'd share:

function dl {
    pushd . > /dev/null
    cd $(dirname $1)
    wget -q $BASE_URL/$1 2> /dev/null
    if [ $? != 0 ]; then
        echo ">> ERROR could not download file \"$1\"" 1>&2
        exit 1
    fi
    popd > /dev/null
}

You can download the file without saving using "-O -" option as

wget -O - http://jagor.srce.hr/

You can get mor information at http://www.gnu.org/software/wget/manual/wget.html#Advanced-Usage

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top