How to get HTTP headers before downloading with Ruby's OpenUri

Question 1

It seems what I wanted is not possible to archieve using OpenURI, at least not, as I said, without loading the whole file first.

I was able to do what I wanted using Net::HTTP's request_get

Here an example:

http.request_get('/largefile.jpg') {|response|
  if (response['content-length'] < max_length)
    response.read_body do |str|   # read body now
      # save to file
    end
  end
}

Note that this only works when using a block, doing it like:

response = http.request_get('/largefile.jpg')

the body will already be read.

Question 2

You can use Net::HTTP for this matter, for example:

require 'net/http'

http = Net::HTTP.start('stackoverflow.com')

resp = http.head('/')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

Another example, this time getting the header of the wonderful book, Object Orient Programming With ANSI-C:

require 'net/http'

http = Net::HTTP.start('www.planetpdf.com')

resp = http.head('/codecuts/pdfs/ooc.pdf')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

Question 3

Rather than use Net::HTTP, which can be like digging a pool on the beach using a sand shovel, you can use a number of the HTTP clients for Ruby and clean up the code.

Here's a sample using HTTParty:

require 'httparty'

resp = HTTParty.head('http://example.org')
resp.headers
# => {"accept-ranges"=>["bytes"], "cache-control"=>["max-age=604800"], "content-type"=>["text/html"], "date"=>["Thu, 02 Mar 2017 18:52:42 GMT"], "etag"=>["\"359670651\""], "expires"=>["Thu, 09 Mar 2017 18:52:42 GMT"], "last-modified"=>["Fri, 09 Aug 2013 23:54:35 GMT"], "server"=>["ECS (oxr/83AB)"], "x-cache"=>["HIT"], "content-length"=>["1270"], "connection"=>["close"]}

At that point it's easy to check the size of the document:

resp.headers['content-length'] # => "1270"

Unfortunately, the HTTPd you're talking to might not know how big the content will be; In order to respond quickly servers don't necessarily calculate the size of dynamically generated output, which would take almost as long and be almost as CPU intensive as actually sending it, so relying on the "content-length" value might be buggy.

The issue with Net::HTTP is it won't automatically handle redirects, so then you have to add additional code. Granted, that code is supplied in the documentation, but the code keeps growing as you need to do more things, until you've ended up writing yet another http client (YAHC). So, avoid that and use an existing wheel.