Unfortunately for my use case, there's not a way to get pre-calculated checksums using standard HTTP headers or via the Net::HTTP request.
Solutions:
If you're in control of the server, you can add arbitrary headers, such as with Nginx or Apache.
Alternatively, one could create and expose a structured dictionary file with key/value pairs for files/checksums, such as the following (cursory) example in JSON:
{ "md5-files": [
{"file1" : "60b725f10c9c85c70d97880dfe8191b3"},
{"file2" : "18ac6fe7ca693bb1767982e2eb3bbd0d")
]}
If one was going to mirror the same file on a plethora of servers, it might be worth building such a structured array locally and only use one server to signal that the file has changed remotely (e.g. master-download-server-1 downloads the file from http://example.org/file1, compares it to its local version, then updates a file. This file could be parsed by slave-download-server1, slave-download-server2 to determine if they should send requests to example.org (or master-download-server-1 itself).
Finally, as I download often from Amazon's S3, I went with only option that I can use while acting as a client-only service: relying on the etag returned in the headers. Unfortunately, the documentation for this is not great, but here's a rough snippet of my approach:
...
#I actually call my own encryption-helper, filename-parsing methods,
#but meta-code for the sake of example:
def example_file_getter(uri, docroot, file)
checksum = Digest::MD5.hexdigest(File.read(file))
uri = URI.parse(uri)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
if response['etag'] != nil
etag = response['etag'].gsub!(/\"/,'')
end
if etag == checksum
file_existed = true
end
if ! File::exists?(destination) && ! file_existed
...actually fetch the file
...
[again, meta-code; this is a summary of the important bits that are relevant to my original question]
Again, the etag documentation is not great and I fully expect Amazon to change this without warning at some point. From what I've pieced together from various forum responses (!!) from Amazon staffers, the general algorithm for the tag is as follows:
- If the file is less than 5GB and was streamed/non-'multipart'-uploaded to the server, the etag is likely the md5 of the uploaded file.
- If the file is >5GB or uploaded via 'multipart', etag is seemingly the final chunk of the uploaded file, denoted by md5-#, where # is the part of the file (e.g. 3 chunks of uploaded files would look like 18ac6fe7ca693bb1767982e2eb3bbd0d-3 in the header).
Not perfect, but if your remote hosts follow a predictable pattern, inspect the headers and hope for the best.