Question

I implemented a function that returns the n-grams of a given input collection as a lazy seq.

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (cons (take n coll) (gen-ngrams n (rest coll))))))

When I call this function with larger input collections, I would expect to see a linear increase in execution time. However, the timing I observe is worse than that:

user> (time (count (gen-ngrams 3 (take 1000 corpus))))
"Elapsed time: 59.426 msecs"
998
user> (time (count (gen-ngrams 3 (take 10000 corpus))))
"Elapsed time: 5863.971 msecs"
9998
user> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 23584.226 msecs"
19998
user> (time (count (gen-ngrams 3 (take 30000 corpus))))
"Elapsed time: 54905.999 msecs"
29998
user> (time (count (gen-ngrams 3 (take 40000 corpus))))
"Elapsed time: 100978.962 msecs"
39998

corpus is a Cons of string tokens.

What causes this behavior and how can I improve the performance?

Was it helpful?

Solution

I think your issue is with "(count coll)", which is iterating over the coll for each call to ngrams.

The solution would be to use the build in partition function:

user=> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 6212.894932 msecs"
19998
user=> (time (count (partition 3 1 (take 20000 corpus))))
"Elapsed time: 12.57996 msecs"
19998

Have a look in the partition source if curious about the implementation http://clojuredocs.org/clojure_core/clojure.core/partition

OTHER TIPS

I am far from a Clojure expert, but I think the cons function causes this problem. Try to use list instead:

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (list (take n coll) (gen-ngrams n (rest coll))))))

I think cons construct a new seq which is more generic than a list, and therefore is slower.

Edit: and if "corpus is a Cons of string tokens", then try to make it a list...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top