Threaded wget - minimalizing resources

https://stackoverflow.com/questions/1428259

shell
wget

07-07-2019
|

Question

I have a script that is getting the GeoIP locations of various ips, this is run daily and I'm going to expect to have around ~50,000 ips to look up.

I have a GeoIP system set up - I just would like to eliminate having to run wget 50,000 times per report.

What I was thinking is, there must be some way to have wget open a connection with the url - then pass the ips, that way it doesn't have to re-establish the connection.

Any help will be much appreciated.

Solution

If you give wget several addresses at once, with consecutive addresses belonging to the same HTTP/1.1 (Connection: keep-alive) supporting server, wget will re-use the already-established connection.

If there are too many addresses to list on the command line, you can write them to a file and use the -i/--input-file= option (and, per UNIX tradition, -i-/--input-file=- reads standard input).

There is, however, no way to preserve a connection across different wget invocations.

OTHER TIPS

You could also write a threaded Ruby script to run wget on multiple input files simultaneously to speed the process up. So if you have 5 files containing 10,000 addresses each, you could use this script:

#!/usr/bin/ruby

threads = []

for file in ARGV
  threads << Thread.new(file) do |filename|
    system("wget -i #{filename}")
  end
end

threads.each { |thrd| thrd.join }

Each of these threads would use one connection to download all addresses in a file. The following command then means only 5 connections to the server to download all 50,000 files.

./fetch.rb "list1.txt" "list2.txt" "list3.txt" "list4.txt" "list5.txt"

You could also write a small program (in Java or C or whatever) that sends the list of files as a POST request and the server returns an object with data about them. Shouldn't be too slow either.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow