Question

I am currently using OpenURI to download a file in Ruby. Unfortunately, it seems impossible to get the HTTP headers without downloading the full file:

open(base_url,
  :content_length_proc => lambda {|t|
    if t && 0 < t
      pbar = ProgressBar.create(:total => t)
  end
  },
  :progress_proc => lambda {|s|
    pbar.progress = s if pbar
  }) {|io|
    puts io.size
    puts io.meta['content-disposition']
  }

Running the code above shows that it first downloads the full file and only then prints the header I need.

Is there a way to get the headers before the full file is downloaded, so I can cancel the download if the headers are not what I expect them to be?

Was it helpful?

Solution 2

It seems what I wanted is not possible to archieve using OpenURI, at least not, as I said, without loading the whole file first.

I was able to do what I wanted using Net::HTTP's request_get

Here an example:

http.request_get('/largefile.jpg') {|response|
  if (response['content-length'] < max_length)
    response.read_body do |str|   # read body now
      # save to file
    end
  end
}

Note that this only works when using a block, doing it like:

response = http.request_get('/largefile.jpg')

the body will already be read.

OTHER TIPS

You can use Net::HTTP for this matter, for example:

require 'net/http'

http = Net::HTTP.start('stackoverflow.com')

resp = http.head('/')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

Another example, this time getting the header of the wonderful book, Object Orient Programming With ANSI-C:

require 'net/http'

http = Net::HTTP.start('www.planetpdf.com')

resp = http.head('/codecuts/pdfs/ooc.pdf')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

Rather than use Net::HTTP, which can be like digging a pool on the beach using a sand shovel, you can use a number of the HTTP clients for Ruby and clean up the code.

Here's a sample using HTTParty:

require 'httparty'

resp = HTTParty.head('http://example.org')
resp.headers
# => {"accept-ranges"=>["bytes"], "cache-control"=>["max-age=604800"], "content-type"=>["text/html"], "date"=>["Thu, 02 Mar 2017 18:52:42 GMT"], "etag"=>["\"359670651\""], "expires"=>["Thu, 09 Mar 2017 18:52:42 GMT"], "last-modified"=>["Fri, 09 Aug 2013 23:54:35 GMT"], "server"=>["ECS (oxr/83AB)"], "x-cache"=>["HIT"], "content-length"=>["1270"], "connection"=>["close"]}

At that point it's easy to check the size of the document:

resp.headers['content-length'] # => "1270"

Unfortunately, the HTTPd you're talking to might not know how big the content will be; In order to respond quickly servers don't necessarily calculate the size of dynamically generated output, which would take almost as long and be almost as CPU intensive as actually sending it, so relying on the "content-length" value might be buggy.

The issue with Net::HTTP is it won't automatically handle redirects, so then you have to add additional code. Granted, that code is supplied in the documentation, but the code keeps growing as you need to do more things, until you've ended up writing yet another http client (YAHC). So, avoid that and use an existing wheel.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top