Refactoring Clojure (1)

This article is based on Writing Friendlier Clojure by Adam Bard, where he shows his approach at refactoring some Clojure code that implements an order-1 word-level Markov text generator.

Our mission is to take this code and make it readable:

(defn markov-data [text]
  (let [maps
        (for [line (clojure.string/split text #"\.")
              m (let [l (str line ".")
                      words
                      (cons :start (clojure.string/split l #"\s+"))]
                  (for [p (partition 2 1 (remove #(= "" %) words))]
                    {(first p) [(second p)]}))]
          m)]
    (apply merge-with concat maps)))

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

After playing a bit with it in the REPL to get a feeling of what is going on and of the shape of the data structure involved, we can start thinking how to refactor.

Remember that refactoring means changing the code without changing its behavior, so to know that we are not changing the behavior we need tests. Since the code exists already, we need characterization tests, well explained in the book Working Effectively with Legacy Code by Michael Feathers.

The following tests characterize the function markov-data:

(deftest markov-of-empty-string
  (is (= {:start ["."]} (markov/markov-data ""))))

(deftest markov-of-one-word-with-stop
  (is (= {:start ["A."]} (markov/markov-data "A."))))

(deftest markov-of-one-word-without-stop
  (is (= {:start ["A."]} (markov/markov-data "A"))))

(deftest markov-of-two-words
  (is (= {:start ["A"], "A" ["B."]} (markov/markov-data "A B."))))

(deftest markov-of-two-1-word-sentences
  (is (= {:start ["A." "B."]} (markov/markov-data "A. B."))))

(deftest markov-of-two-2-word-sentences
  (is (= {:start ["A" "C"], "A" ["B."], "C" ["D."]}
         (markov/markov-data "A B. C D."))))

(deftest markov-of-two-sentences-with-repetition
  (is (= {:start ["A" "A"], "A" ["B." "B"], "B" ["C."]}
         (markov/markov-data "A B. A B C"))))

(deftest markov-of-four-words-with-repetition
  (is (= {:start ["A"], "A" ["B" "B."], "B" ["A"]}
         (markov/markov-data "A B A B."))))

From the REPL and the tests we see that function markov-data takes a string and returns a hash map whose keys are the words in the string and whose values are sequences of the words just after the key, implementing an order-1 word-level Markov process. In addition the special key :start represents all the words that start a sentence.

Let’s look at the original code again:

(defn markov-data [text]
  (let [maps
        (for [line (clojure.string/split text #"\.")
              m (let [l (str line ".")
                      words
                      (cons :start (clojure.string/split l #"\s+"))]
                  (for [p (partition 2 1 (remove #(= "" %) words))]
                    {(first p) [(second p)]}))]
          m)]
    (apply merge-with concat maps)))

It seems easier to start from scratch as opposed to refactor it. The characterization tests above will help ensuring that the returned hash map stays the same.

We notice that the input string text is made of one or more sentences, where a sentence is a sequence of words terminated by a full stop character (.). As such, we can start by writing a function to process only one sentence.

We take the sentence, we split it in words and we consume one word after the other, updating the hash map as we go. This calls for a reduce. The trick is to realize that the function passed to reduce is not really limited to two arguments (first the accumulator and second the element of the collection), because the accumulator can be a vector and we can do destructuring:

(defn process-sentence [data sentence]
  (let [[data _]                                                                   ; <1>
        (reduce (fn [[data key] word]                                              ; <2>
                    [(update data key (fn [val] (if val (conj val word) [word])))  ; <3>
                     word])                                                        ; <4>
                [data :start]
                (string/split (string/trim sentence) #"\s+"))]
    data))                                                                         ; <5>

<2> The function passed to reduce takes as arguments [[data key] word], where the vector [data key] contains both the real accumulator (the hash map data) and the word key.
<3> Everything is done here, with update. The if is needed to differentiate when the hash map doesn’t have key key (deduced because val is nil).
<3>,<4> Construct the return value of the function, which is the vector [updated-data word], where updated-data is the updated hash map and word is the element of the collection, that will be used on the next call of the function as the hash map key.
<1> The let is needed to deconstruct the last return of the function, extract the hash map and return only it at <5>.

Now that we have a function able to process a single sentence, we can go back to markov-data and realize that we can use reduce again, this time passing to it process-sentence and splitting the input string text on the boundaries of the full stop character:

(defn markov-data [text]
  (let [data (->> (string/split text #"\.")
                  (filter (complement string/blank?))
                  (map #(str % "."))                        ; <1>
                  (reduce process-sentence {}))]
    (if (empty? data)
      {:start ["."]}                                        ; <2>
      data)))

<1> We add a full stop character to each sentence. This is visible also in the characterization tests. The reason is because the full stop character is the termination condition for the other function sentence, as we will see.
<2> Again we make sure that the hash map contains at least one full stop character.

Now it is the turn of function sentence, for which we could expect difficulties in testing, since it involves randomness.

The first 3 characterization tests are simple, no randomness involved:

(deftest sentence-of-empty-string-is-the-dot
  (is (= "." (markov/sentence (markov/markov-data "")))))

(deftest sentence-of-one-word-is-itself
  (is (= "A." (markov/sentence (markov/markov-data "A.")))))

(deftest sentence-of-one-sentence-is-itself
  (is (= "A B C." (markov/sentence (markov/markov-data "A B C.")))))

The next 2 tests are more complicated because they must consider the randomness of a Markov process:

(deftest sentence-of-non-repeating-words-is-one-of-the-original-sentences
  (let [out (markov/sentence (markov/markov-data "A B C D. E F G. H I."))]
    (is (some #{out} ["A B C D." "E F G." "H I."]))))

(deftest sentence-of-simple-repetition
  (let [out (markov/sentence (markov/markov-data "A B C. B D."))]
    (is (some #{out} ["B C." "A B D." "B D." "A B C."]))))

The trick here is to use a small input and enumerate all the possible outputs, using the construct

(some #{"A"} ["A" "B" "C"])

which is the Clojure idiomatic way to test if element "A" is contained in collection ["A" "B" "C"].

Now we can refactor. The original again:

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)                       ; <1>
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))                       ; <2>
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

<1> Introduces the randomness of the Markov process.
<2> The termination condition.

For our refactoring, we replace the loop with a recursive function call and we use the fact that a Clojure function can have different arities. The version with arity 1, respecting the API, calls the version with arity 3, the recursion point.

We also use slightly longer variable names to increase readability and the idiomatic way to add to a vector:

(defn sentence
  ([data]
   (sentence data (:start data) []))
  ([data words acc]
   (let [word     (rand-nth words)
         acc-next (conj acc word)]
     (if (string/ends-with? word ".")
       (string/join " " acc-next)
       (recur data (get data word) acc-next)))))

We are done. Although the refactored code is slightly longer than the original, it is readable and so maintainable.

References

#clojure #refactoring #markov-chain