clojure frequency dictionary from big data

Question 1

Using apply instead of reduce in the anonymous function avoids the StackOverflow exception. Instead of (fn [x] (frequencies (reduce concat (map #(rest %) x)))) use (fn [x] (frequencies (apply concat (map #(rest %) x)))).

The following is the same code a little refactored, but with the exact same logic. The read-data-from-file was changed to avoid mapping over the sequence of lines twice.

(use 'clojure.string)
(use 'clojure.java.io)

(defn read-data-from-file [fname]
  (let [lines (with-open [rdr (reader fname)] 
                (doall (line-seq rdr)))]
    (map #(-> % lower-case (split #"\s")) lines)))

(defn do-to-map [m keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) m keyseq))

(defn process-words [x]
  (->> x 
    (map #(rest %)) 
    (apply concat) ; This is the only real change from the 
                   ; original code, it used to be (reduce concat).
    frequencies))

(defn dicts-from-data [raw_data]
  (let [data (group-by first raw_data)]
    (do-to-map data
               (keys data) 
               process-words)))

(-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)

Question 2

One other thing to consider is the use of (doall (line-seq ...)), which reads the entire word list into memory. This could cause problems if the list is very large. A handy trick for accumulating data like this is to use reduce. In your case, we need to reduce twice: once over the lines, and then over the words in each line. Something like this:

(defn parse-line
  [line]
  (str/split (str/lower-case line) #"\s+"))

(defn build-word-freq
  [file]
  (with-open [rdr (io/reader file)]
    (reduce (fn [accum line]
              (let [[spam-or-ham & words] (parse-line line)]
                (reduce #(update-in %1 [spam-or-ham %2] (fnil inc 0)) accum words)))
            {}
            (line-seq rdr))))