Question

I was under the impression that the lazy seqs were always chunked.

=> (take 1 (map #(do (print \.) %) (range)))
(................................0)

As expected 32 dots are printed because the lazy seq returned by range is chunked into 32 element chunks. However, when instead of range I try this with my own function get-rss-feeds, the lazy seq is no longer chunked:

=> (take 1 (map #(do (print \.) %) (get-rss-feeds r)))
(."http://wholehealthsource.blogspot.com/feeds/posts/default")

Only one dot is printed, so I guess the lazy-seq returned by get-rss-feeds is not chunked. Indeed:

=> (chunked-seq? (seq (range)))
true

=> (chunked-seq? (seq (get-rss-feeds r)))
false

Here is the source for get-rss-feeds:

(defn get-rss-feeds
  "returns a lazy seq of urls of all feeds; takes an html-resource from the enlive library"
  [hr]
  (map #(:href (:attrs %))
       (filter #(rss-feed? (:type (:attrs %))) (html/select hr [:link])))

So it appears that chunkiness depends on how the lazy seq is produced. I peeked at the source for the function range and there are hints of it being implemented in a "chunky" manner. So I'm a bit confused as to how this works. Can someone please clarify?


Here's why I need to know.

I have to following code: (get-rss-entry (get-rss-feeds h-res) url)

The call to get-rss-feeds returns a lazy sequence of URLs of feeds that I need to examine.

The call to get-rss-entry looks for a particular entry (whose :link field matches the second argument of get-rss-entry). It examines the lazy sequence returned by get-rss-feeds. Evaluating each item requires an http request across the network to fetch a new rss feed. To minimize the number of http requests it's important to examine the sequence one-by-one and stop as soon as there is a match.

Here is the code:

(defn get-rss-entry
  [feeds url]
  (ffirst (drop-while empty? (map #(entry-with-url % url) feeds))))

entry-with-url returns a lazy sequence of matches or an empty sequence if there is no match.

I tested this and it seems to work correctly (evaluating one feed url at a time). But I am worried that somewhere, somehow it will start behaving in a "chunky" way and it will start evaluating 32 feeds at a time. I know there is a way to avoid chunky behavior as discussed here, but it doesn't seem to even be required in this case.

Am I using lazy seq non-idiomatically? Would loop/recur be a better option?

Was it helpful?

Solution

Depending on the vagueness of Chunking seems unwise as you mention above. Explicitly "un chunking" in cases where you really need it not to be chunked is also wise because then if at some other point your code changes in a way that chunkifies it things wont break. On another note, if you need actions to be sequential, agents are a great tool you could send the download functions to an agent then they will be run one at a time and only once regardless of how you evaluate the function. At some point you may want to pmap your sequence and then even un-chunking will not work though using an atom will continue to work correctly.

OTHER TIPS

You are right to be concerned. Your get-rss-entry will indeed call entry-with-url more than strictly necessary if the feeds parameter is a collection that returns chunked seqs. For example if feeds is a vector, map will operate on whole chunks at a time.

This problem is addressed directly in Fogus' Joy of Clojure, with the function seq1 defined in chapter 12:

(defn seq1 [s]
  (lazy-seq
    (when-let [[x] (seq s)]
      (cons x (seq1 (rest s)))))) 

You could use this right where you know you want the most laziness possible, right before you call entry-with-url:

(defn get-rss-entry
  [feeds url]
  (ffirst (drop-while empty? (map #(entry-with-url % url) (seq1 feeds)))))

Lazy seqs are not always chunked - it depends on how they are produced.

For example, the lazy seq produced by this function is not chunked:

(defn integers-from [n]
  (lazy-seq (cons n (do (print \.) (integers-from (inc n))))))

(take 3 (integers-from 3))
=> (..3 .4 5)

But many other clojure built-in functions do produce chunked seqs for performance reasons (e.g. range)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top