Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static cluster membership set, but when a new node outside of the list joins, its Registry and DynamicSupervisor still joins the cluster? #210

Open
x-ji opened this issue Sep 17, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@x-ji
Copy link

x-ji commented Sep 17, 2020

We are now trying to use static cluster membership, since the dynamic cluster membership seems to be causing issues when one k8s pod becomes temporarily invisible, probably due to some automatic k8s maintenance operations (we're using libcluster's Kubernetes.DNSSRV strategy).

The members argument is specified as a list:

      [
        {App.Module.Registry,
         :"app@app-service-0.app-service-headless.#{namespace}.svc.cluster.local"},
        {App.Module.Registry,
         :"app@app-service-1.app-service-headless.#{namespace}.svc.cluster.local"},
        {App.Module.Registry,
         :"app@app-service-2.app-service-headless.#{namespace}.svc.cluster.local"}
      ]

The setup is similar for the DynamicSupervisor module.

We have a stateful set deployment with 3 replicas. When I try to scale down the pods with kubectl scale statefulset service --replica 2, things seem to work as expected. If I call Horde.Cluster.members(App.Module.Registry), I still see the original list.

However, if I try to scale up with kubectl scale statefulset service --replica 4, the Registry spun up on the new node seems to still join the cluster for some reason. If I run Horde.Cluster.members(App.Module.Registry), I see the extra entry {App.Module.Registry, :"app@app-service-3.app-service-headless.#{namespace}.svc.cluster.local"}.

Interestingly, even if I scale back to 3 again, that extra Registry remains in the members list, while the extra DynamicSupervisor is gone.

Is this the expected behavior? From the documentation, I thought that Horde should only try to find the members listed in the static list, and not try to add new members to that list. I would expect the Registry and DynamicSupervisor on the fourth node be ignored.

We're using the Horde.UniformQuorumDistribution strategy for the Supervisor though I feel that should be irrelevant to the membership issue.

@x-ji x-ji changed the title Static cluster membership set, but when a new node outside of the list joins, its Registry and Supervisor still joins the cluster? Static cluster membership set, but when a new node outside of the list joins, its Registry and DynamicSupervisor still joins the cluster? Sep 17, 2020
@derekkraan
Copy link
Owner

Ah hmm, I wonder if it works to start a Horde.Registry or Horde.DynamicSupervisor and tell it that it will not be part of the cluster.

@x-ji
Copy link
Author

x-ji commented Sep 17, 2020

Right, so this is what happened in this case, and apparently the new Registry/DynamicSupervisor will still try to join the cluster regardless of the static list, which doesn't actually include it.

I guess this is not the intended usage of the static cluster membership. We were trying to use dynamic cluster membership but it didn't work out. Scaling to 4 replicas was also more of a hypothetical test which shouldn't happen in a real k8s cluster with a fixed number of replicas.

Still, I wonder if it would be possible to do something in this case and prevent the new Registry/DynamicSupervisor from joining, or maybe just shut it down if it has a members option which doesn't include itself? Not sure how complicated it would be to implement. Or alternatively, whether it would make sense to mention this scenario in the documentation.

@x-ji
Copy link
Author

x-ji commented Sep 17, 2020

By the way, when I tried to scale down from 4 to 3 again, an (EXIT) no process error similar to #202 happened on all 3 of the remaining nodes.

** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]}
11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated
11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error
11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"

So I guess this scenario is probably something unexpected for Horde.

@derekkraan
Copy link
Owner

I think you're right that this should at the very least be included in the documentation.

I suppose it would be possible to check whether an instance of Horde.Registry was in its own list of members (I guess by ensuring that at least one of the members resolved to self() in the presence of Process.whereis/1 or equivalent). What would be the correct behaviour if the condition was not met? Raising an error?

@derekkraan
Copy link
Owner

By the way, when I tried to scale down from 4 to 3 again, an (EXIT) no process error similar to #202 happened on all 3 of the remaining nodes.

** (stop) exited in: GenServer.stop(Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor, :normal, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.2) lib/gen_server.ex:971: GenServer.stop/3
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:605: Horde.DynamicSupervisorImpl.shut_down_all_processes/1
    (horde 0.8.1) lib/horde/dynamic_supervisor_impl.ex:374: Horde.DynamicSupervisorImpl.handle_info/2
    (stdlib 3.11.2) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11.2) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:crdt_update, [{:add, {:member_node_info, {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}}, %Horde.DynamicSupervisor.Member{name: {Assistant.Inbox.Sync.Supervisor, :"assistant@assistant-service-0.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"}, status: :shutting_down}}]}
11:46:18.377 [info] Starting Horde.DynamicSupervisorImpl with name Assistant.Inbox.Sync.Supervisor
11:46:18.371 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor' started with 'Elixir.Horde.ProcessesSupervisor':start_link([{shutdown,infinity},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{type,supervisor},{name,...},...]) at <0.9760.0> exit with reason normal in context child_terminated
11:46:18.374 [error] gen_server 'Elixir.Assistant.Inbox.Sync.Supervisor' terminated with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.376 [error] CRASH REPORT Process 'Elixir.Assistant.Inbox.Sync.Supervisor' with 0 neighbours exited with reason: no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in 'Elixir.GenServer':stop/3 line 971
11:46:18.377 [error] Supervisor 'Elixir.Assistant.Inbox.Sync.Supervisor.Supervisor' had child 'Elixir.Horde.DynamicSupervisorImpl' started with 'Elixir.Horde.DynamicSupervisorImpl':start_link([{name,'Elixir.Assistant.Inbox.Sync.Supervisor'},{root_name,'Elixir.Assistant.Inbox.Sync.Supervisor'},...]) at <0.9758.0> exit with reason no such process or port in call to 'Elixir.GenServer':stop('Elixir.Assistant.Inbox.Sync.Supervisor.ProcessesSupervisor', normal, infinity) in context shutdown_error
11:49:02.666 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:02.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.673 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"
11:49:12.679 [warn] [libcluster:assistant_service] unable to connect to :"assistant@assistant-service-2.assistant-service-headless.review-ka-1621-te-gng7km.svc.cluster.local"

So I guess this scenario is probably something unexpected for Horde.

I believe this is a different scenario to #202. At least, the stacktrace does not match. I have looked into this before, but couldn't find anything obvious. I hope this isn't happening to people on a regular basis, it should be possible to reduce the size of your horde cluster without the whole thing falling apart.

@derekkraan derekkraan added the bug Something isn't working label Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants