GenServer get restarted twice after node restarts #224

mszmurlo · 2021-03-07T10:48:12Z

I recently ran into the following issue. A GenServer (called CA in the rest of the post) gets restarted two times by the
Horde.DynamicSupervisor when nodes had been removed and re-included few times in the cluster.
The source code demonstrating the issue is available on Gitlab.

I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.

Demonstration

To demonstrate the issue:

Start the application
Create a GenServer (module POC.CA in my code) with the HTTP interface, i.e. that is execute the POCWeb.Login.exec() controller
Kill (with two times Control-c) the node on which the CA is running, restart it, re-do the operation and observe the logs

The two first steps are as follows:

Start node 1 in one terminal : $ HTTP_PORT=5001 ERL_AFLAGS="-name [email protected] -setcookie abc" iex -S mix phx.server
Start node 2 in another terminal : $ HTTP_PORT=5002 ERL_AFLAGS="-name [email protected] -setcookie abc" iex -S mix phx.server
Send the following request $ http -v http://localhost:5001/api cmd=login from a third terminal.

In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:

Terminal 1

    ... 
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex([email protected])1> [debug] NodeListener.handle_info(:nodeup)
    [info] [libcluster:example] connected to :"[email protected]"
    [debug] Cluster members: [{POC.DReg, :"[email protected]"}, {POC.DReg, :"[email protected]"}]
    [debug] Cluster members: [{POC.DSup, :"[email protected]"}, {POC.DSup, :"[email protected]"}]
(3) [info] POST /api
    [debug] Processing with POCWeb.Login.exec/2
      Parameters: %{"cmd" => "login"}
      Pipelines: [:api]
(4) [debug] Login.exec()
(5) [debug] Login.exec(): start_child->res={:ok, #PID<21019.478.0>}
    [info] Sent 200 in 99ms

The BEAM is booting
Detection of the second node joining the cluster
Reception of the HTTP request
Start of the login controller. The controller calls Horde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}}) which will call CA.start_link()
End of the login controller

Terminal 2

    ...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex([email protected])1> [debug] NodeListener.handle_info(:nodeup)
    [debug] Cluster members: [{POC.DReg, :"[email protected]"}, {POC.DReg, :"[email protected]"}]
    [debug] Cluster members: [{POC.DSup, :"[email protected]"}, {POC.DSup, :"[email protected]"}]
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(4) [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(5) [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.478.0>}
    [debug] CA.start_link(): process started

The BEAM is booting
Detection of the second node joining the cluster
CA.start_link called by the login controller. Calls GenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
CA.init() called by GenServer.start_link() at step 3.
Back in CA.start_link(). The process was created on N2

Killing and restarting the nodes

From now on, we will

Kill the node on which the CA is running
Restart that node
Observe the logs
Go to step 1.

Kill N2 (as the `CA` is running on N2)

Terminal 1

(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"[email protected]"}]
    [debug] Cluster members: [{POC.DSup, :"[email protected]"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.494.0>}
    [debug] CA.start_link(): process started

Detection of N2 failure. Cluster reorganization
Restart of the CA.

Restart of node 2

Terminal 2

(1) $ HTTP_PORT=5002 ERL_AFLAGS="-name [email protected] -setcookie abc" iex -S mix phx.server
(2) Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1]
    ...
    Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(3)    [debug] NodeListener.handle_info(:nodeup)
    iex([email protected])1> [debug] Cluster members: [{POC.DReg, :"[email protected]"}, {POC.DReg, :"[email protected]"}]
    [debug] Cluster members: [{POC.DSup, :"[email protected]"}, {POC.DSup, :"[email protected]"}]

Starting node N2
BEAM boots and the application starts
Cluster reorganization. Same actually happens on node N1

Kill N1 (as the `CA` had been restarted on N1)

Terminal 2

(1) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.484.0>}
    [debug] CA.start_link(): process started

Process for the CA is restarted on node N2

Restart of node 1

The cluster gets reorganized as before

Kill N2

This where the strange thing will happen. Look at step 3.

Terminal 1

(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"[email protected]"}]
    [debug] Cluster members: [{POC.DSup, :"[email protected]"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.482.0>}
    [debug] CA.start_link(): process started
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:error, {:already_started, #PID<0.482.0>}}
    [debug] CA.start_link(): error: {:error, {:already_started, #PID<0.482.0>}}. process not started

Detection of the failure of node N2. Cluster re-organization
The CA gets restarted. This is normal. The following 3 lines show that the process is restarts successfully
CA.start_link() get called a second time. As the process had already been started by step 2. the error :already_started is returned

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenServer get restarted twice after node restarts #224

GenServer get restarted twice after node restarts #224

mszmurlo commented Mar 7, 2021 •

edited

Loading

GenServer get restarted twice after node restarts #224

GenServer get restarted twice after node restarts #224

Comments

mszmurlo commented Mar 7, 2021 • edited Loading

Demonstration

Terminal 1

Terminal 2

Killing and restarting the nodes

Kill N2 (as the CA is running on N2)

Terminal 1

Restart of node 2

Terminal 2

Kill N1 (as the CA had been restarted on N1)

Terminal 2

Restart of node 1

Kill N2

Terminal 1

mszmurlo commented Mar 7, 2021 •

edited

Loading

Kill N2 (as the `CA` is running on N2)

Kill N1 (as the `CA` had been restarted on N1)