Fault Tolerance
As explained in Actor Systems each actor is the supervisor of its children, and as such each actor defines fault handling supervisor strategy. This strategy cannot be changed afterwards as it is an integral part of the actor system’s structure.
§Fault Handling in Practice
First, let us look at a sample that illustrates one way to handle data store errors, which is a typical source of failure in real world applications. Of course it depends on the actual application what is possible to do when the data store is unavailable, but in this sample we use a best effort re-connect approach.
Read the following source code. The inlined comments explain the different pieces of the fault handling and why they are added. It is also highly recommended to run this sample as it is easy to follow the log output to understand what is happening in runtime.
§Creating a Supervisor Strategy
The following sections explain the fault handling mechanism and alternatives in more depth.
For the sake of demonstration let us consider the following strategy:
- import akka.actor.OneForOneStrategy
- import akka.actor.SupervisorStrategy._
- import scala.concurrent.duration._
-
- override val supervisorStrategy =
- OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
- case _: ArithmeticException => Resume
- case _: NullPointerException => Restart
- case _: IllegalArgumentException => Stop
- case _: Exception => Escalate
- }
I have chosen a few well-known exception types in order to demonstrate the application of the fault handling directives described in Supervision and Monitoring. First off, it is a one-for-one strategy, meaning that each child is treated separately (an all-for-one strategy works very similarly, the only difference is that any decision is applied to all children of the supervisor, not only the failing one). There are limits set on the restart frequency, namely maximum 10 restarts per minute; each of these settings could be left out, which means that the respective limit does not apply, leaving the possibility to specify an absolute upper limit on the restarts or to make the restarts work infinitely. The child actor is stopped if the limit is exceeded.
The match statement which forms the bulk of the body is of type Decider
,
which is a PartialFunction[Throwable, Directive]
. This
is the piece which maps child failure types to their corresponding directives.
Note
If the strategy is declared inside the supervising actor (as opposed to
within a companion object) its decider has access to all internal state of
the actor in a thread-safe fashion, including obtaining a reference to the
currently failed child (available as the sender
of the failure message).
§Default Supervisor Strategy
Escalate
is used if the defined strategy doesn't cover the exception that was thrown.
When the supervisor strategy is not defined for an actor the following exceptions are handled by default:
ActorInitializationException
will stop the failing child actorActorKilledException
will stop the failing child actorException
will restart the failing child actor- Other types of
Throwable
will be escalated to parent actor
If the exception escalate all the way up to the root guardian it will handle it in the same way as the default strategy defined above.
You can combine your own strategy with the default strategy:
- import akka.actor.OneForOneStrategy
- import akka.actor.SupervisorStrategy._
- import scala.concurrent.duration._
-
- override val supervisorStrategy =
- OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
- case _: ArithmeticException => Resume
- case t =>
- super.supervisorStrategy.decider.applyOrElse(t, (_: Any) => Escalate)
- }
§Stopping Supervisor Strategy
Closer to the Erlang way is the strategy to just stop children when they fail
and then take corrective action in the supervisor when DeathWatch signals the
loss of the child. This strategy is also provided pre-packaged as
SupervisorStrategy.stoppingStrategy
with an accompanying
StoppingSupervisorStrategy
configurator to be used when you want the
"/user"
guardian to apply it.
§Logging of Actor Failures
By default the SupervisorStrategy
logs failures unless they are escalated.
Escalated failures are supposed to be handled, and potentially logged, at a level
higher in the hierarchy.
You can mute the default logging of a SupervisorStrategy
by setting
loggingEnabled
to false
when instantiating it. Customized logging
can be done inside the Decider
. Note that the reference to the currently
failed child is available as the sender
when the SupervisorStrategy
is
declared inside the supervising actor.
You may also customize the logging in your own SupervisorStrategy
implementation
by overriding the logFailure
method.
§Supervision of Top-Level Actors
Toplevel actors means those which are created using system.actorOf()
, and
they are children of the User Guardian. There are no
special rules applied in this case, the guardian simply applies the configured
strategy.
§Test Application
The following section shows the effects of the different directives in practice, wherefor a test setup is needed. First off, we need a suitable supervisor:
- import akka.actor.Actor
-
- class Supervisor extends Actor {
- import akka.actor.OneForOneStrategy
- import akka.actor.SupervisorStrategy._
- import scala.concurrent.duration._
-
- override val supervisorStrategy =
- OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
- case _: ArithmeticException => Resume
- case _: NullPointerException => Restart
- case _: IllegalArgumentException => Stop
- case _: Exception => Escalate
- }
-
- def receive = {
- case p: Props => sender() ! context.actorOf(p)
- }
- }
This supervisor will be used to create a child, with which we can experiment:
- import akka.actor.Actor
-
- class Child extends Actor {
- var state = 0
- def receive = {
- case ex: Exception => throw ex
- case x: Int => state = x
- case "get" => sender() ! state
- }
- }
The test is easier by using the utilities described in Testing Actor Systems.
- import akka.testkit.{ TestActors, TestKit, ImplicitSender }
- import org.scalatest.{ WordSpecLike, Matchers, BeforeAndAfterAll }
- import akka.testkit.{ ImplicitSender, EventFilter }
- import akka.actor.{ ActorSystem, ActorRef, Props, Terminated }
- import com.typesafe.config.ConfigFactory
-
- class FaultHandlingDocSpec(_system: ActorSystem) extends TestKit(_system)
- with ImplicitSender with WordSpecLike with Matchers with BeforeAndAfterAll {
-
- def this() = this(ActorSystem("FaultHandlingDocSpec",
- ConfigFactory.parseString("""
- akka {
- loggers = ["akka.testkit.TestEventListener"]
- loglevel = "WARNING"
- }
- """)))
-
- override def afterAll {
- TestKit.shutdownActorSystem(system)
- }
-
- "A supervisor" must {
-
- "apply the chosen strategy for its child" in {
- // code here
- }
- }
- }
Let us create actors:
- val supervisor = system.actorOf(Props[Supervisor], "supervisor")
-
- supervisor ! Props[Child]
- val child = expectMsgType[ActorRef] // retrieve answer from TestKit’s testActor
The first test shall demonstrate the Resume
directive, so we try it out by
setting some non-initial state in the actor and have it fail:
- child ! 42 // set state to 42
- child ! "get"
- expectMsg(42)
-
- child ! new ArithmeticException // crash it
- child ! "get"
- expectMsg(42)
As you can see the value 42 survives the fault handling directive. Now, if we
change the failure to a more serious NullPointerException
, that will no
longer be the case:
- child ! new NullPointerException // crash it harder
- child ! "get"
- expectMsg(0)
And finally in case of the fatal IllegalArgumentException
the child will be
terminated by the supervisor:
- watch(child) // have testActor watch “child”
- child ! new IllegalArgumentException // break it
- expectMsgPF() { case Terminated(`child`) => () }
Up to now the supervisor was completely unaffected by the child’s failure,
because the directives set did handle it. In case of an Exception
, this is not
true anymore and the supervisor escalates the failure.
- supervisor ! Props[Child] // create new child
- val child2 = expectMsgType[ActorRef]
-
- watch(child2)
- child2 ! "get" // verify it is alive
- expectMsg(0)
-
- child2 ! new Exception("CRASH") // escalate failure
- expectMsgPF() {
- case t @ Terminated(`child2`) if t.existenceConfirmed => ()
- }
The supervisor itself is supervised by the top-level actor provided by the
ActorSystem
, which has the default policy to restart in case of all
Exception
cases (with the notable exceptions of
ActorInitializationException
and ActorKilledException
). Since the
default directive in case of a restart is to kill all children, we expected our poor
child not to survive this failure.
In case this is not desired (which depends on the use case), we need to use a different supervisor which overrides this behavior.
- class Supervisor2 extends Actor {
- import akka.actor.OneForOneStrategy
- import akka.actor.SupervisorStrategy._
- import scala.concurrent.duration._
-
- override val supervisorStrategy =
- OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
- case _: ArithmeticException => Resume
- case _: NullPointerException => Restart
- case _: IllegalArgumentException => Stop
- case _: Exception => Escalate
- }
-
- def receive = {
- case p: Props => sender() ! context.actorOf(p)
- }
- // override default to kill all children during restart
- override def preRestart(cause: Throwable, msg: Option[Any]) {}
- }
With this parent, the child survives the escalated restart, as demonstrated in the last test:
- val supervisor2 = system.actorOf(Props[Supervisor2], "supervisor2")
-
- supervisor2 ! Props[Child]
- val child3 = expectMsgType[ActorRef]
-
- child3 ! 23
- child3 ! "get"
- expectMsg(23)
-
- child3 ! new Exception("CRASH")
- child3 ! "get"
- expectMsg(0)