Question

I want to write my own naive bayes classifier I have a file like this:

(This is database of spam and ham messages, first word points to spam or ham, text until eoln is message (size: 0.5 Mb) from here http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

ham     Go until jurong point, crazy.. Available only in bugis n gre
at world la e buffet... Cine there got amore wat...
ham     Ok lar... Joking wif u oni...
spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham     U dun say so early hor... U c already then say...
ham     Nah I don't think he goes to usf, he lives around here though
spam    FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

And i want to make a hashmap like this: {"spam" {"go" 1, "until" 100, ...}, "ham" {......}} Hash map, where every value is frequency map of words (for ham and spam separatly)

I know, how do it by python or c++, and i made it by clojure, but my solution failed (stackoverflow) in large data

My solution:

(defn read_data_from_file [fname]
    (map #(split % #"\s")(map lower-case (with-open [rdr (reader fname)] 
        (doall (line-seq rdr))))))

(defn do-to-map [amap keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) amap keyseq))

(defn dicts_from_data [raw_data]
    (let [data (group-by #(first %) raw_data)]
        (do-to-map
            data (keys data) 
                (fn [x] (frequencies (reduce concat (map #(rest %) x)))))))

I've tryed to find where it false and wrote this

(def raw_data (read_data_from_file (first args)))
(def d (group-by #(first %) raw_data))
(def f (map frequencies raw_data))
(def d1 (reduce concat (d "spam")))
(println (reduce concat (d "ham")))

Error:

Exception in thread "main" java.lang.RuntimeException: java.lang.StackOverflowError
    at clojure.lang.Util.runtimeException(Util.java:165)
    at clojure.lang.Compiler.eval(Compiler.java:6476)
    at clojure.lang.Compiler.eval(Compiler.java:6455)
    at clojure.lang.Compiler.eval(Compiler.java:6431)
    at clojure.core$eval.invoke(core.clj:2795)
    at clojure.main$eval_opt.invoke(main.clj:296)
    at clojure.main$initialize.invoke(main.clj:315)
.....

Can anyone help me to make this better/effective? PS Sorry for my writing mistakes. English in not my native language.

Was it helpful?

Solution

Using apply instead of reduce in the anonymous function avoids the StackOverflow exception. Instead of (fn [x] (frequencies (reduce concat (map #(rest %) x)))) use (fn [x] (frequencies (apply concat (map #(rest %) x)))).

The following is the same code a little refactored, but with the exact same logic. The read-data-from-file was changed to avoid mapping over the sequence of lines twice.

(use 'clojure.string)
(use 'clojure.java.io)

(defn read-data-from-file [fname]
  (let [lines (with-open [rdr (reader fname)] 
                (doall (line-seq rdr)))]
    (map #(-> % lower-case (split #"\s")) lines)))

(defn do-to-map [m keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) m keyseq))

(defn process-words [x]
  (->> x 
    (map #(rest %)) 
    (apply concat) ; This is the only real change from the 
                   ; original code, it used to be (reduce concat).
    frequencies))

(defn dicts-from-data [raw_data]
  (let [data (group-by first raw_data)]
    (do-to-map data
               (keys data) 
               process-words)))

(-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)

OTHER TIPS

One other thing to consider is the use of (doall (line-seq ...)), which reads the entire word list into memory. This could cause problems if the list is very large. A handy trick for accumulating data like this is to use reduce. In your case, we need to reduce twice: once over the lines, and then over the words in each line. Something like this:

(defn parse-line
  [line]
  (str/split (str/lower-case line) #"\s+"))

(defn build-word-freq
  [file]
  (with-open [rdr (io/reader file)]
    (reduce (fn [accum line]
              (let [[spam-or-ham & words] (parse-line line)]
                (reduce #(update-in %1 [spam-or-ham %2] (fnil inc 0)) accum words)))
            {}
            (line-seq rdr))))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top