Akka: How to schedule retries on failure with growing delay intervals?

https://stackoverflow.com/questions/10364654

04-06-2021
|

Question

What is a good way of having an actor try something again on failure, but with growing time intervals between the retries? Let's say I want the actor to try again after 15 seconds, then 30 seconds, then every minute for a limited number of times.

Here's what I've come up with:

the method of the actor that performs the actual work has an optional RetryInfo parameter that, if present, contains the number of the retry we are currently in
on failure, the actor will send itself a new ScheduleRetryMessage with retryCount + 1, then throw a RuntimeException
another actor supervises the worker actor, using new OneForOneStrategy(-1, Duration.Inf() returning Resume as its Directive. The actor has no state, so Resume should be OK
on receiving the ScheduleRetryMessage, the actor will
- if retryCount < MAX_RETRIES: use Akka's scheduler to schedule sending a RetryMessage after the desired delay
- else: finally give up, sending a message to another actor for error reporting

Is this a good solution or is there a better approach?

Solution

You could have a supervisor that starts the worker actor. The tip from the docs is to declare a router of size one for the worker. The supervisor would keep track of the number of retries, then schedule the message send to the worker as appropriate.

Even though you would be creating another layer of actors, this seems cleaner to me since you would be keeping the supervisory functionality out of the worker. Ideally you could make this 1 supervisor to n workers, but I think you would have to use Lifecycle Monitoring to get a Failure from a child actor. In that case, you could just keep a map of [ActorRef, Int] to keep track of the number of retries for all supervised workers. The supervision policy would Resume, but if you reached your max retries, you could send a PoisonPill to the offending ActorRef.

OTHER TIPS

In such cases I use standard supervision. A parent/supervising actor defines retries within a time window. The retrying worker child simply re-schedules the message which caused the failure with a delay in preRestart().

If the retrying child is rather complex, you might consider to interconnect an intermediate actor. That actor simply escalates supervision. On preRestart the intermediate actor schedules a (delayed) restart-message. As the intermediate actor preserved its state it can simply restart the worker actor (with delay).

As you can see the delaying part might be in preRestart or on start of the worker.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow