How to create configurable reliability layer with Photon.NET

Note: this article is based on the Photon.NET codebase tag 0.0.28.0 (follow the project link to find additional info on acquiring the specific version)

Reliability layer represents a simple concept: we want to retry some unreliable method call (the one that throws exceptions or returns unexpected results) till it succeeds or some other condition is met.

In the regular code this could achieved by writing snippet like this for every single call we want to protect:

var failedCounter = 0;
while (true)
{
  try
  {
    _provider.SaveModel(model);
    break;
  }
  catch (DbException ex)
  {
    failedCounter += 1;
    var id = _exceptionCounter.Add(ex);
    if (failedCounter >= 5)
    {
      _log.FatalFormat("We are giving up on error {0}...", id);
      throw new GenericRepositoryException("Repository exception, ID=" + id);
    }
    _log.InfoFormat("Non-fatal exception. Retring in 1 sec...");
    SystemUtil.Sleep(1.Seconds());
  }
}

I didn’t even test this code, since it is too boring, hardcoded and unefficient.

We could achieve similar effect, by encapsulating the retrying rules inside the exception policy. Here’s how the IExceptionHandler implementation looks like:

/// <summary>
/// Note, that the values are hard-coded. But they could 
/// be easily extended to support container-level configuration
/// </summary>
internal sealed class ReliableRepositoryHandler : IExceptionHandler
{
  private readonly IExceptionCounter _counter;
  private readonly INamedProvider<ILog> _provider;
  private int _failedCounter;

  public ReliableRepositoryHandler(IExceptionCounter counter,
    INamedProvider<ILog> provider)
  {
    _counter = counter;
    _provider = provider;
  }

  public void Dispose()
  {
    if (_failedCounter != 0)
    {
      var log = _provider.CreateLog<ReliableRepositoryHandler>();
      log.Debug("We have recovered from the exception");
    }
  }

  public bool CanHandle(Exception ex)
  {
    if (ex is DbException)
    {
      _failedCounter += 1;
      var id = _counter.Add(ex);
      var log = _provider.CreateLog<RepositoryExceptionHandler>();

      if (_failedCounter >= 5)
      {
        log.FatalFormat("Exception {0} is wrapped and rethrown", id);
        throw new GenericRepositoryException("Repository exception, ID=" + id);
      }

      log.InfoFormat("Non-fatal exception {0}, Retrying in 1 sec...", id);
      SystemUtil.Sleep(1.Seconds());
      return true;
    }

    return false;
  }
}

And then we just need to register it’s policy (IExceptionPolicy is responsible for creating instances of IExceptionHandler) with some name in the container (i.e.: “RepositoryReliability”) and use the magical keyword retry (that’s a really simple Boo macro, actually) in the interception syntax from the last post:

// intercept another method
def SaveModel(model as Model):
  _log.Debug("Saving model...")
  retry _policy.Get("RepositoryReliability"):
    _inner.SaveModel(model)

That’s it. Now, when you execute Photon.ModelRunner you will see that IModelProvider.SaveModel calls, according to the policy, do not fail, should the underlying component throw DbException instances from time to time. Sleep time between failures is 1 sec and we can tolerate up to 5 failures before throwing up wrapped exception. Every exception is also reported to the counter (we’ll get to this later).

All this reliability logic is expressed conveniently within the ReliableRepositoryHandler class and could be reused to make any interceptable call reliable with just a couple of words.

As you have noticed already, this sample policy contains some hardcoded constants like “1.Seconds()” or 5 retry times. It is natural to make them configurable somehow via the IoC infrastructure from the external file. But XML syntax for the Autofac is too noisy and inflexible (as any XML syntax would be)… As always stay tuned.