Question

I'm trying to randomly sample a large FASTQ file and write it to standard out. I keep getting 'GC overhead limit exceeded' errors and I'm not sure what I'm doing wrong. I've tried increasing Xmx in leiningen but that didn't help. Here is my code:

(ns fastq-sample.core
  (:gen-class)
  (:use clojure.java.io))

(def n-read-pair-lines 8)

(defn sample? [sample-rate]
  (> sample-rate (rand)))

;
; Agent for writing the reads asynchronously
;

(def wtr (agent (writer *out*)))

(defn write-out [r]
  (letfn [(write [out msg] (.write out msg) out)]
    (send wtr write r)))

(defn write-close []
  (send wtr #(.close %))
  (await wtr))

;
; Main
;

(defn reads [file]
  (->>
    (input-stream file)
    (java.util.zip.GZIPInputStream.)
    (reader)
    (line-seq)))

(defn -main [fastq-file sample-rate-str]
  (let [sample-rate (Float. sample-rate-str)
        in-reads    (partition n-read-pair-lines (reads fastq-file))]
    (doseq [x (filter (fn [_] (sample? sample-rate)) in-reads)]
      (write-out (clojure.string/join "\n" x)))
    (write-close)
    (shutdown-agents)))
Was it helpful?

Solution

This is the same symptom I often get when I try to merge an infinite sequence into a simgle data structure like a map or vector. It very often means that memory was tight and the garbage collector could not keep up with demand for new objects. Most likely the wtr agent is too large for memory. Perhaps you may want to not store the printed results in the atom by changing

(write [out msg] (.write out msg) out)

to

(write [out msg] (.write out msg))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top