Question

I am trying to parse a list of image URL's and get some basic information before I actually commit to download.

  1. Is the image there (solved with response.code?)
  2. Do I have the image already (want to look at type and size?)

My script will check a large list every day (about 1300 rows) and each row has 30-40 image URLs. My @photo_urls variable allows me to keep track of what I have downloaded already. I would really like to be able to use that later as a hash (instead of an array in my example code) to interate through later and do the actual downloading.

Right now my problem (besides being a Ruby newbie) is that Net::HTTP::Pipeline only accepts an array of Net::HTTPRequest objects. The documentation for net-http-pipeline indicates that response objects will come back in the same order as the corresponding request objects that went in. The problem is that I have no way to correlate the request to the response other than that order. However, I don't know how to get relative ordinal position inside a block. I assume I could just have a counter variable but how would I access a hash by ordinal position?

          Net::HTTP.start uri.host do |http|
            # Init HTTP requests hash
            requests = {}
            photo_urls.each do |photo_url|          
              # make sure we don't process the same image again.
              hashed = Digest::SHA1.hexdigest(photo_url)         
              next if @photo_urls.include? hashed
              @photo_urls << hashed
              # change user agent and store in hash
              my_uri = URI.parse(photo_url)
              request = Net::HTTP::Head.new(my_uri.path)
              request.initialize_http_header({"User-Agent" => "My Downloader"})
              requests[hashed] = request
            end
            # process requests (send array of values - ie. requests) in a pipeline.
            http.pipeline requests.values do |response|
              if response.code=="200"
                  # anyway to reference the hash here so I can decide whether
                  # I want to do anything later?
              end
            end                
          end

Finally, if there is an easier way of doing this, please feel free to offer any suggestions.

Thanks!

Was it helpful?

Solution

Make requests an array instead of a hash and pop off the requests as the responses come in:

Net::HTTP.start uri.host do |http|
  # Init HTTP requests array
  requests = []
  photo_urls.each do |photo_url|          
    # make sure we don't process the same image again.
    hashed = Digest::SHA1.hexdigest(photo_url)         
    next if @photo_urls.include? hashed
    @photo_urls << hashed

    # change user agent and store in hash
    my_uri = URI.parse(photo_url)
    request = Net::HTTP::Head.new(my_uri.path)
    request.initialize_http_header({"User-Agent" => "My Downloader"})
    requests << request
  end

  # process requests (send array of values - ie. requests) in a pipeline.
  http.pipeline requests.dup do |response|
    request = requests.shift

    if response.code=="200"
      # Do whatever checking with request
    end
  end                
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top