Orso Labs

Refactoring Clojure (1)

This article is based on Writing Friendlier Clojure by Adam Bard, where he shows his approach at refactoring some Clojure code that implements an order-1 word-level Markov text generator.

Our mission is to take this code and make it readable:

(defn markov-data [text]
  (let [maps
        (for [line (clojure.string/split text #"\.")
              m (let [l (str line ".")
                      words
                      (cons :start (clojure.string/split l #"\s+"))]
                  (for [p (partition 2 1 (remove #(= "" %) words))]
                    {(first p) [(second p)]}))]
          m)]
    (apply merge-with concat maps)))

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

After playing a bit with it in the REPL to get a feeling of what is going on and of the shape of the data structure involved, we can start thinking how to refactor.

Remember that refactoring means changing the code without changing its behavior, so to know that we are not changing the behavior we need tests. Since the code exists already, we need characterization tests, well explained in the book Working Effectively with Legacy Code by Michael Feathers.

The following tests characterize the function markov-data:

(deftest markov-of-empty-string
  (is (= {:start ["."]} (markov/markov-data ""))))

(deftest markov-of-one-word-with-stop
  (is (= {:start ["A."]} (markov/markov-data "A."))))

(deftest markov-of-one-word-without-stop
  (is (= {:start ["A."]} (markov/markov-data "A"))))

(deftest markov-of-two-words
  (is (= {:start ["A"], "A" ["B."]} (markov/markov-data "A B."))))

(deftest markov-of-two-1-word-sentences
  (is (= {:start ["A." "B."]} (markov/markov-data "A. B."))))

(deftest markov-of-two-2-word-sentences
  (is (= {:start ["A" "C"], "A" ["B."], "C" ["D."]}
         (markov/markov-data "A B. C D."))))

(deftest markov-of-two-sentences-with-repetition
  (is (= {:start ["A" "A"], "A" ["B." "B"], "B" ["C."]}
         (markov/markov-data "A B. A B C"))))

(deftest markov-of-four-words-with-repetition
  (is (= {:start ["A"], "A" ["B" "B."], "B" ["A"]}
         (markov/markov-data "A B A B."))))

From the REPL and the tests we see that function markov-data takes a string and returns a hash map whose keys are the words in the string and whose values are sequences of the words just after the key, implementing an order-1 word-level Markov process. In addition the special key :start represents all the words that start a sentence.

Let’s look at the original code again:

(defn markov-data [text]
  (let [maps
        (for [line (clojure.string/split text #"\.")
              m (let [l (str line ".")
                      words
                      (cons :start (clojure.string/split l #"\s+"))]
                  (for [p (partition 2 1 (remove #(= "" %) words))]
                    {(first p) [(second p)]}))]
          m)]
    (apply merge-with concat maps)))

It seems easier to start from scratch as opposed to refactor it. The characterization tests above will help ensuring that the returned hash map stays the same.

We notice that the input string text is made of one or more sentences, where a sentence is a sequence of words terminated by a full stop character (.). As such, we can start by writing a function to process only one sentence.

We take the sentence, we split it in words and we consume one word after the other, updating the hash map as we go. This calls for a reduce. The trick is to realize that the function passed to reduce is not really limited to two arguments (first the accumulator and second the element of the collection), because the accumulator can be a vector and we can do destructuring:

(defn process-sentence [data sentence]
  (let [[data _]                                                                   ; <1>
        (reduce (fn [[data key] word]                                              ; <2>
                    [(update data key (fn [val] (if val (conj val word) [word])))  ; <3>
                     word])                                                        ; <4>
                [data :start]
                (string/split (string/trim sentence) #"\s+"))]
    data))                                                                         ; <5>

Now that we have a function able to process a single sentence, we can go back to markov-data and realize that we can use reduce again, this time passing to it process-sentence and splitting the input string text on the boundaries of the full stop character:

(defn markov-data [text]
  (let [data (->> (string/split text #"\.")
                  (filter (complement string/blank?))
                  (map #(str % "."))                        ; <1>
                  (reduce process-sentence {}))]
    (if (empty? data)
      {:start ["."]}                                        ; <2>
      data)))

Now it is the turn of function sentence, for which we could expect difficulties in testing, since it involves randomness.

The first 3 characterization tests are simple, no randomness involved:

(deftest sentence-of-empty-string-is-the-dot
  (is (= "." (markov/sentence (markov/markov-data "")))))

(deftest sentence-of-one-word-is-itself
  (is (= "A." (markov/sentence (markov/markov-data "A.")))))

(deftest sentence-of-one-sentence-is-itself
  (is (= "A B C." (markov/sentence (markov/markov-data "A B C.")))))

The next 2 tests are more complicated because they must consider the randomness of a Markov process:

(deftest sentence-of-non-repeating-words-is-one-of-the-original-sentences
  (let [out (markov/sentence (markov/markov-data "A B C D. E F G. H I."))]
    (is (some #{out} ["A B C D." "E F G." "H I."]))))

(deftest sentence-of-simple-repetition
  (let [out (markov/sentence (markov/markov-data "A B C. B D."))]
    (is (some #{out} ["B C." "A B D." "B D." "A B C."]))))

The trick here is to use a small input and enumerate all the possible outputs, using the construct

(some #{"A"} ["A" "B" "C"])

which is the Clojure idiomatic way to test if element "A" is contained in collection ["A" "B" "C"].

Now we can refactor. The original again:

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)                       ; <1>
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))                       ; <2>
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

For our refactoring, we replace the loop with a recursive function call and we use the fact that a Clojure function can have different arities. The version with arity 1, respecting the API, calls the version with arity 3, the recursion point.

We also use slightly longer variable names to increase readability and the idiomatic way to add to a vector:

(defn sentence
  ([data]
   (sentence data (:start data) []))
  ([data words acc]
   (let [word     (rand-nth words)
         acc-next (conj acc word)]
     (if (string/ends-with? word ".")
       (string/join " " acc-next)
       (recur data (get data word) acc-next)))))

We are done. Although the refactored code is slightly longer than the original, it is readable and so maintainable.

References

#clojure #refactoring #markov-chain