Pergunta

I'm using Akka 2.2 contrib's project ClusterSingletonManager to guarantee there is always one and just one specific type of actor (master) in a cluster. However, I've observed an odd behaviour (which, incidentally, may be expected, but can't understand why). Whenever a master drops out of the cluster and joins in later, the following sequence of actions occur:

[INFO] [04/30/2013 17:47:35.805] [ClusterSystem-akka.actor.default-dispatcher-9] [akka://ClusterSystem/system/cluster/core/daemon] Cluster Node [akka.tcp://ClusterSystem@127.0.0.1:2551] - Welcome from [akka.tcp://ClusterSystem@127.0.0.1:2552]
[INFO] [04/30/2013 17:47:48.703] [ClusterSystem-akka.actor.default-dispatcher-8] [akka://ClusterSystem/user/singleton] Member removed [akka.tcp://ClusterSystem@127.0.0.1:52435]
[INFO] [04/30/2013 17:47:48.712] [ClusterSystem-akka.actor.default-dispatcher-2] [akka://ClusterSystem/user/singleton] ClusterSingletonManager state change [Start -> BecomingLeader]
[INFO] [04/30/2013 17:47:49.752] [ClusterSystem-akka.actor.default-dispatcher-9] [akka://ClusterSystem/user/singleton] Retry [1], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:50.850] [ClusterSystem-akka.actor.default-dispatcher-21] [akka://ClusterSystem/user/singleton] Retry [2], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:51.951] [ClusterSystem-akka.actor.default-dispatcher-20] [akka://ClusterSystem/user/singleton] Retry [3], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:53.049] [ClusterSystem-akka.actor.default-dispatcher-3] 

...

[INFO] [04/30/2013 17:48:10.650] [ClusterSystem-akka.actor.default-dispatcher-21] [akka://ClusterSystem/user/singleton] Retry [20], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:48:11.751] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] Timeout in BecomingLeader. Previous leader unknown, removed and no TakeOver request.
[INFO] [04/30/2013 17:48:11.752] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] Singleton manager [akka.tcp://ClusterSystem@127.0.0.1:2551] starting singleton actor
[INFO] [04/30/2013 17:48:11.754] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] ClusterSingletonManager state change [BecomingLeader -> Leader]

Why is it attempting to send an HandOverToMe to [None]? It takes about 20 seconds (20 retries) until it becomes the new leader, though in this particular situation the previous one was well known...

Foi útil?

Solução

I'm not sure if this will answer your question, but in looking at the source code for ClusterSingletonManager, you can see the chain of events that leads to this scenario. This class uses the Finite State Machine logic in Akka, and the behavior you are seeing is kicked off due to a state transition from Start -> BecomingLeader. First, look at the Start state:

when(Start) {
  case Event(StartLeaderChangedBuffer, _) ⇒
    leaderChangedBuffer = context.actorOf(Props[LeaderChangedBuffer].withDispatcher(context.props.dispatcher))
    getNextLeaderChanged()
    stay

  case Event(InitialLeaderState(leaderOption, memberCount), _) ⇒
    leaderChangedReceived = true
    if (leaderOption == selfAddressOption && memberCount == 1)
      // alone, leader immediately
      gotoLeader(None)
    else if (leaderOption == selfAddressOption)
      goto(BecomingLeader) using BecomingLeaderData(None)
    else
      goto(NonLeader) using NonLeaderData(leaderOption)
}

The part to look at here is:

    else if (leaderOption == selfAddressOption)
      goto(BecomingLeader) using BecomingLeaderData(None)

To me, it looks like this piece is saying "If I'm the leader, change start to Become Leader with None as the previousLeader option"

Then, if you look at the BecomingLeader state:

when(BecomingLeader) {
  ...
  case Event(HandOverRetry(count), BecomingLeaderData(previousLeaderOption)) ⇒
    if (count <= maxHandOverRetries) {
      logInfo("Retry [{}], sending HandOverToMe to [{}]", count, previousLeaderOption)
      previousLeaderOption foreach { peer(_) ! HandOverToMe }
      setTimer(HandOverRetryTimer, HandOverRetry(count + 1), retryInterval, repeat = false)
    } else if (previousLeaderOption forall removed.contains) {
      // can't send HandOverToMe, previousLeader unknown for new node (or restart)
      // previous leader might be down or removed, so no TakeOverFromMe message is received
      logInfo("Timeout in BecomingLeader. Previous leader unknown, removed and no TakeOver request.")
      gotoLeader(None)
    } else
        throw new ClusterSingletonManagerIsStuck(
        s"Becoming singleton leader was stuck because previous leader [${previousLeaderOption}] is unresponsive")
  }

This is the block that keeps repeating that message you are seeing in the log. It basically looks like it's attempting to get a previous leader to hand over responsibility to w/o knowing who the previous leader was because in the state transition, it passed in None as the previous leader. The million dollar question is "If it doesn't know who the previous leader is, why keep attempting handoffs that will never succeed?".

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top