Question

Without being able to control or add headers server-side, is it possible to compare a local checksum against a remote file without downloading the entire file and comparing checksums using Ruby and Net::HTTP?

I'm populating a disk with files using a class I've written using Net::HTTP and would like to increase my bandwidth-thriftiness via comparison of the remote file against a SHA256 sum of my local file; I only want to download a remote file when my local copy doesn't match the remote version.

Here are my assumptions:

  • The filenames may be the same, but the contents may differ.

  • The 'Last-modified' date in the HTTP headers is not a good indication of a change - a cp /dir_a/file1.tar /dir_b/file2.tar results in identical checksums, but differing 'Last-modified' times.

  • HTTP header Etags are not a good indicator: http://example.org/file1.tar and http://example.iana.org//file1.tar may have different Etags for the same file.

  • HTTP header Etags are not entirely standard -- while EC2 uses md5sums to generate their Etags, other hosts may not. This makes local generation of this tagging value difficult.

  • Maintaining a hash/dictionary of hostname-to-Etag implementations is unwieldy and a bad approach.

While I'm relatively certain that the server-side software would have to provide a facility for doing a file/tag/checksum comparison to accomplish this goal (e.g. a checksum field in the header or separate look-up file), I would like confirmation of my assumptions before abandoning this pursuit. I've left out my existing code to avoid distraction, as I'm looking to how to approach implementation.

Was it helpful?

Solution

Unfortunately for my use case, there's not a way to get pre-calculated checksums using standard HTTP headers or via the Net::HTTP request.

Solutions:

If you're in control of the server, you can add arbitrary headers, such as with Nginx or Apache.

Alternatively, one could create and expose a structured dictionary file with key/value pairs for files/checksums, such as the following (cursory) example in JSON:

{ "md5-files": [
    {"file1" : "60b725f10c9c85c70d97880dfe8191b3"}, 
    {"file2" : "18ac6fe7ca693bb1767982e2eb3bbd0d")
]}

If one was going to mirror the same file on a plethora of servers, it might be worth building such a structured array locally and only use one server to signal that the file has changed remotely (e.g. master-download-server-1 downloads the file from http://example.org/file1, compares it to its local version, then updates a file. This file could be parsed by slave-download-server1, slave-download-server2 to determine if they should send requests to example.org (or master-download-server-1 itself).

Finally, as I download often from Amazon's S3, I went with only option that I can use while acting as a client-only service: relying on the etag returned in the headers. Unfortunately, the documentation for this is not great, but here's a rough snippet of my approach:

...
#I actually call my own encryption-helper, filename-parsing methods, 
#but meta-code for the sake of example:
def example_file_getter(uri, docroot, file)
    checksum = Digest::MD5.hexdigest(File.read(file))

    uri = URI.parse(uri)
    http = Net::HTTP.new(uri.host, uri.port)
    request = Net::HTTP::Get.new(uri.request_uri)
    response = http.request(request)

    if response['etag'] != nil
        etag = response['etag'].gsub!(/\"/,'')
    end

    if etag == checksum
      file_existed = true
    end

    if ! File::exists?(destination) && ! file_existed
    ...actually fetch the file    
...

[again, meta-code; this is a summary of the important bits that are relevant to my original question]

Again, the etag documentation is not great and I fully expect Amazon to change this without warning at some point. From what I've pieced together from various forum responses (!!) from Amazon staffers, the general algorithm for the tag is as follows:

  1. If the file is less than 5GB and was streamed/non-'multipart'-uploaded to the server, the etag is likely the md5 of the uploaded file.
  2. If the file is >5GB or uploaded via 'multipart', etag is seemingly the final chunk of the uploaded file, denoted by md5-#, where # is the part of the file (e.g. 3 chunks of uploaded files would look like 18ac6fe7ca693bb1767982e2eb3bbd0d-3 in the header).

Not perfect, but if your remote hosts follow a predictable pattern, inspect the headers and hope for the best.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top