Clojure: Store and Compile Large Derived Data Structure

Question 1

Embedding data this big in compiled code may not be possible because of size limits imposed upon the JVM. In particular, no single method may exceed 64 KiB in length. Embedding data in the way I describe further below also necessitates including tons of stuff in the class file it's going to live in; doesn't seem like a great idea.

Given that you're using the data structure read-only, you can construct it once, then emit it to a .clj / .edn (that's for edn, the serialization format based on Clojure literal notation), then include that file on your class path as a "resource", so that it's included in the überjar (in resources/ with default Leiningen settings; it'll then get included in the überjar unless excluded by :uberjar-exclusions in project.clj) and read it from the resource at runtime at full speed of Clojure's reader:

(ns foo.core
  (:require [clojure.java.io :as io]))

(defn get-the-huge-data-structure []
  (let [r   (io/resource "huge.edn")
        rdr (java.io.PushbackReader. (io/reader r))]
    (read r)))

;; if you then do something like this:

(def ds (get-the-huge-data-structure))

;; your app will load the data as soon as this namespace is required;
;; for your :main namespace, this means as soon as the app starts;
;; note that if you use AOT compilation, it'll also be loaded at
;; compile time

You could also not add it to the überjar, but rather add it to the classpath when running your app. This way your überjar itself would not have to be huge.

Handling stuff other than persistent Clojure data could be accomplished using print-method (when serializing) and reader tags (when deserializing). Arthur already demonstrated using reader tags; to use print-method, you'd do something like

(defmethod print-method clojure.lang.Ref [x writer]
  (.write writer "#ref ")
  (print-method @x writer))

;; from the REPL, after doing the above:

user=> (pr-str {:foo (ref 1)})
"{:foo #ref 1}"

Of course you only need to have the print-method methods defined when serializing; you're deserializing code can leave it alone, but will need appropriate data readers.

Disregarding the code size issue for a moment, as I find the data embedding issue interesting:

Assuming your data structure only contains immutable data natively handled by Clojure (Clojure persistent collections, arbitrarily nested, plus atomic items such as numbers, strings (atomic for this purpose), keywords, symbols; no Refs etc.), you can indeed include it in your code:

(defmacro embed [x]
  x)

The generated bytecode will then recreate x without reading anything, by using constants included in the class file and static methods of the clojure.lang.RT class (e.g. RT.vector and RT.map).

This is, of course, how literals are compiled, since the macro above is a noop. We can make things more interesting though:

(ns embed-test.core
  (:require [clojure.java.io :as io])
  (:gen-class))

(defmacro embed-resource [r]
  (let [r   (io/resource r)
        rdr (java.io.PushbackReader. (io/reader r))]
    (read r)))

(defn -main [& args]
  (println (embed-resource "foo.edn")))

This will read foo.edn at compile time and embed the result in the compiled code (in the sense of including appropriate constants and code to reconstruct the data in the class file). At run time, no further reading will be performed.

Question 2

Is this structure something that doesn't change? If not, consider using Java serialization to persist the structure. Deserializing will be much faster than rebuilding every time.

Question 3

If you can structure the tree to be a single value instead of a tree fo references to many values then you would be able to print the tree and read it. Because refs are not readable you won't be able to treat the entire tree as something readable without doing doing your own parsing.

It may be worth looking into using the extensible reader to add print and read functions for your tree by making it a type.

here is a minimal example of using data-readers to produce references to sets and maps from a string:

first define handlers for the contents of each EDN tag/type

user> (defn parse-map-ref [m] (ref (apply hash-map m)))
#'user/parse-map-ref
user> (defn parse-set-ref [s] (ref (set s)))
#'user/parse-set-ref

Then bind the map data-readers to associate the handlers with textual tags:

(def y-as-string 
   "#user/map-ref [:zebra #user/set-ref [1 2 3 4]]")

user> (def y (binding [*data-readers* {'user/set-ref user/parse-set-ref
                                       'user/map-ref user/parse-map-ref}]
              (read-string y-as-string)))

user> y
#<Ref@6d130699: {:zebra #<Ref@7c165ec0: #{1 2 3 4}>}>

this also works with more deeply nested trees:

(def z-as-string 
  "#user/map-ref [:zebra #user/set-ref [1 2 3 4] 
                  :ox #user/map-ref [:amimal #user/set-ref [42]]]")

user> (def z (binding [*data-readers* {'user/set-ref user/parse-set-ref
                                       'user/map-ref user/parse-map-ref}]
               (read-string z-as-string)))
#'user/z
user> z
#<Ref@2430c1a0: {:ox #<Ref@7cf801ef: {:amimal #<Ref@7e473201: #{42}>}>, 
                 :zebra #<Ref@7424206b: #{1 2 3 4}>}>

producing strings from trees can be accomplished by extending the print-method multimethod, though it would be a lot easier if you define a type for ref-map and ref-set using deftype so the printer can know which ref should produce which strings.

If in general reading them as strings is too slow there are faster binary serialization libraries like protocol buffers.