Question

I have an application distributed over 2 nodes. When I halt() the first node the failover works perfectly, but ( sometimes ? ) when I restart the first node the takeover fails and the application crashes since start_link returns already started.

SUPERVISOR REPORT  <0.60.0>                                 2009-05-20 12:12:01
===============================================================================
Reporting supervisor                          {local,twitter_server_supervisor}

Child process
   errorContext                                                     start_error
   reason                                         {already_started,<2415.62.0>}
   pid                                                                undefined
   name                                                                    tag1
   start_function                                {twitter_server,start_link,[]}
   restart_type                                                       permanent
   shutdown                                                               10000
   child_type                                                            worker

ok

My app

start(_Type, Args)->
    twitter_server_supervisor:start_link( Args ).

stop( _State )->
    ok.

My supervisor :

start_link( Args ) ->
    supervisor:start_link( {local,?MODULE}, ?MODULE, Args ).    

Both nodes are using the same sys.config file.

What am I not understanding about this process that the above should not work ?

Was it helpful?

Solution

It seems like your problem stem from twitter server supervisor trying to start one of its children. Since the error report complains about the child with start_function

{twitter_server,start_link,[]}

And since you are not showing that code, I can only guess that it is trying to register a name for itself, but there is already a process registered with that name.

Even more guessing, the reason shows a Pid, the Pid that has the name that we tried to grab for ourself:

{already_started,<2415.62.0>}

The Pid there has a non-zero initial integer, if it was zero it means it is a local process. From which I deduce that you are trying to register a global name, and you are connected to another node where there is already a process globally registered by that name.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top