Question

Evening,

I am running a lot of wget commands using xargs

cat urls.txt | xargs -n 1 -P 10 wget -q -t 2 --timeout 10 --dns-timeout 10 --connect-timeout 10 --read-timeout 20

However, once the file has been parsed, some of the wget instances 'hang.' I can still see them in system monitor, and it can take about 2 minutes for them all to complete.

Is there anyway I can specify that the instance should be killed after 10 seconds? I can re-download all the URLs that failed later.

In system monitor, the wget instances are shown as sk_wait_data when they hang. xargs is there as 'do_wait,' but wget seems to be the issue, as once I kill them, my script continues.

Was it helpful?

Solution

I believe this should do it:

wget -v -t 2 --timeout 10

According to the docs:

  • --timeout: Set the network timeout to seconds seconds. This is equivalent to specifying --dns-timeout, --connect-timeout, and --read-timeout, all at the same time.

Check the verbose output too and see more of what it's doing.

Also, you can try:

timeout 10  wget -v -t 2

Or you can do what timeout does internally:

( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec wget -v -t 2 )

(As seen in: BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")

OTHER TIPS

GNU Parallel can download in parallel, and retry the process after a timeout:

cat urls.txt | parallel -j10 --timeout 10 --retries 3 wget -q -t 2 

If the time for an url to be fetched changes (e.g. due to faster internet connection), you can let GNU Parallel figure out the timeout:

cat urls.txt | parallel -j10 --timeout 1000% --retries 3 wget -q -t 2 

This will make GNU Parallel record the median time for a successful job and set the timeout dynamically to 10 times that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top