Try Three Times in Clojure

Summary: Distributed systems fail in indistinguishable ways. Often, retrying is a good solution to intermittent errors. We create a retry macro to handle the retries in a generic way.

Let's face it: your system is probably a distributed system. All web apps are by definition distributed. They have at least one server, probably a separate database server, and many browser clients. And now microservices are getting popular. Distributed is the current and future normal. While Clojure solves the problems of multiple cores sharing memory at the language level, distributed systems problems are left to be addressed at the application level.

The problem

One big problem that comes up all the time in distributed systems is dealing with failure. Failure happens everywhere. The problem in a distributed system is that you don't know where the failure happened. For example, let's say you make an HTTP GET request and 20 seconds later, you're still waiting for the response. Is it:

  • A network failure?
    • Did the message not get to the server?
    • Did the message get there, but the response didn't make it back?
  • The server is down?
  • The server is still working?
  • The response is still coming?
  • An intermediate computer (proxy) has filtered the request/response?

It is literally impossible to know what the problem is. And that's ok. There's a lot of machinery between one machine and the next. Even if you could diagnose the problem, are you really going to program each error case?

Metaphor

Let's say you call your friend and they don't pick up. Are they asleep? Is their phone off? Did the call not go through? The phone won't tell you. And you really want to talk to them. So what do you do? You call back. You might even call back a couple of times. If they pick up when you call back, great! If not, then you get tired and give up.

That's a common approach in distributed systems as well: retry your distributed message a few times before you give up. It's easy and fixes a surprising number of problems. What's more, there's a good solution that's simple in the Hickeyan sense.

The solution

Failure in Clojure typically means an Exception. So we'll need to catch exceptions and run code multiple times.

    (defn try-n-times [f n]
      (if (zero? n)
        (f)
        (try
          (f)
          (catch Throwable _
            (try-n-times f (dec n))))))

You pass it a function and a number of times to retry it. The base case is when n is 0. In that case, it will just try it (not retry). If it's greater than 0, it will wrap the function call in a try/catch, catch everything, and recurse. If after n retries, is still throws an exception, try-n-times will fail and some other code will have to deal with it. The concern of retrying is separated from what is being retried.

How do you use it?

Wrap your distributed calls in this bad boy and you're good to go.

Instead of this:

    (http/get "http://somewhat-reliable.com/resource"
              {:socket-timeout 1000
               :conn-timeout   1000})

You do this:

    (try-n-times #(http/get "http://somewhat-reliable.com/resource"
                            {:socket-timeout 1000
                             :conn-timeout 1000}) 2)

Remember, n is the number of retries. So that's 1 try + 2 retries.

Macro, anyone?

Alright, yes, I made a macro for that. It does come in handy to have a macro that you can put code in instead of passing in a function.

    (defmacro try3 [& body]
      `(try-n-times (fn [] ~@body) 2))

This one is used like this:

    (try3
      (println "trying!")
      (do-some-stuff))

Warning

Now, a little care needs to be taken when you use this. Remember, when you get a failure, it could be a timeout. The server could be processing your request. Or it could have failed halfway through a multi-step process. What that means practically is that your distributed message has to be idempotent. HTTP GET is idempotent, so it's ok. POST generally is not, but sometimes it is. Use your judgment! Also, you should make your call timeout, to turn long waits into errors.

Conclusion

This pattern is just one piece of a larger distributed system puzzle. The network and servers are unreliable. They might work the whole time during development, but in the fullness of time, an always-on distributed system will have some kind of failure eventually. Sometimes the failures are temporary, and in those cases, a quick retry can fix it right away.

Though Clojure does not have specific solutions to distributed systems problems, coding them up is short and straightforward. If you're interested in learning Clojure, I suggest you check out LispCast Introduction to Clojure. It's a video course that uses animation, storytelling, and exercises to install Clojure into your brain.