Non-linear slowdown creating a lazy seq in Clojure
-
08-06-2021 - |
Pregunta
I implemented a function that returns the n-grams of a given input collection as a lazy seq.
(defn gen-ngrams
[n coll]
(if (>= (count coll) n)
(lazy-seq (cons (take n coll) (gen-ngrams n (rest coll))))))
When I call this function with larger input collections, I would expect to see a linear increase in execution time. However, the timing I observe is worse than that:
user> (time (count (gen-ngrams 3 (take 1000 corpus))))
"Elapsed time: 59.426 msecs"
998
user> (time (count (gen-ngrams 3 (take 10000 corpus))))
"Elapsed time: 5863.971 msecs"
9998
user> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 23584.226 msecs"
19998
user> (time (count (gen-ngrams 3 (take 30000 corpus))))
"Elapsed time: 54905.999 msecs"
29998
user> (time (count (gen-ngrams 3 (take 40000 corpus))))
"Elapsed time: 100978.962 msecs"
39998
corpus
is a Cons
of string tokens.
What causes this behavior and how can I improve the performance?
Solución
I think your issue is with "(count coll)", which is iterating over the coll for each call to ngrams.
The solution would be to use the build in partition function:
user=> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 6212.894932 msecs"
19998
user=> (time (count (partition 3 1 (take 20000 corpus))))
"Elapsed time: 12.57996 msecs"
19998
Have a look in the partition source if curious about the implementation http://clojuredocs.org/clojure_core/clojure.core/partition
Otros consejos
I am far from a Clojure expert, but I think the cons function causes this problem. Try to use list instead:
(defn gen-ngrams
[n coll]
(if (>= (count coll) n)
(lazy-seq (list (take n coll) (gen-ngrams n (rest coll))))))
I think cons construct a new seq which is more generic than a list, and therefore is slower.
Edit: and if "corpus is a Cons of string tokens", then try to make it a list...