Question

I'm trying to build a lazy seq which pulls its data from aws S3 as needed, (via the amazonica library). I've got the following code which almost does what I want, but makes one more network call than needed. (If there is more data available, it always realizes one more recursive call)

Edit: Thanks Alex, for pointing out that my println was in a place where it would be called even if the network call wasn't realized. This code performs as desired now. So that just leaves the question is there a better way to do it?

(defn chunked-list-objects-seq
  "Returns a listing of objects in a bucket, with given prefix. These
are lazily chunked, to avoid unneeded network calls.
opts are :bucket-name :prefix :next-marker"
  [cred opts]
  (lazy-seq
   (let [response (s3/list-objects cred opts)
         chunk-size (count (:object-summaries response))]
     (println "pulling from network")
     (chunk-cons
      (let [buffer (chunk-buffer chunk-size)]
        (dotimes [i chunk-size]
          (chunk-append buffer (nth (:object-summaries response) i)))
        (chunk buffer))
      (if (:truncated? response) 
            (chunked-list-objects-seq cred (assoc opts :next-marker (:next-marker response)))
        nil)))))

Above code was adapted from "Clojure High Performance Programming" pg. 28 (custom chunking)

Calling it looks like this:

user> (time (pprint (count (take 990 (chunked-list-objects-seq cred {:bucket-name "bucket-name" :prefix "path-prefix/"})))))
=> pulling from network
   990
   "Elapsed time: 2009.723 msecs"

(AWS seems to like returning 1k chunks, when there are more than 1k items in a bucket)

There are certainly other ways to do this, (an atom & future implementation comes to mind), but this seems to fit the interface of a seq the best.

So basically, can this code be fixed to not make unnecessary network calls, and is this a good way to do this?

Was it helpful?

Solution

I think that making a lazy sequence with chunking that fetches blocks of data over the network is a perfectly reasonable approach - with the only caveat being that extra care is needed if the S3 client code happens to rely on having any dynamic bindings set.

Your initial code had the println outside the call to set up the lazy-seq to fetch the next block of data, so you were seeing the message printed regardless of whether the next block was actually fetched. Putting the println closer to the call to list-objects will give you a better idea of when the network request is made.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top