Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horde OTP 27 issue #275

Open
nash8114 opened this issue Sep 3, 2024 · 5 comments
Open

Horde OTP 27 issue #275

nash8114 opened this issue Sep 3, 2024 · 5 comments

Comments

@nash8114
Copy link

nash8114 commented Sep 3, 2024

When upgrading our Horde-dependent application to OTP 27 we experienced an issue that requests were not being picked up. This issue should affect all Horde applications which use the default strategy Horde.UniformDistribution

Horde depends on libring to implement Horde.UniformDistribution. Due to changes in OTP 27 libring no longer has a uniform distribution but instead falls back to a "List.first" (actually :gb_trees.smallest(r)). See also bitwalker/libring#35
Because of an implementation choice in libring, the change in OTP causes the algorithm to silently change its behavior. The outcome will always be the same node, regardless of the input.

When all nodes of a DynamicSupervisor agree on the topology, they should choose the same node for a given child_spec. The request is proxied to that node, and its choose_node outcome should pick itself, so it will start the child.
When the algorithms on two nodes disagree, this can cause infinite message proxieing, where node-A selects node-B, and node-B selects node-A. They will proxy to each other infinitely (behavior for which I want to create a PR later).

Nodes with OTP 27 will have different choose_node outcome than OTP 26 or earlier, until libring is patched. Upgrading to OTP 27 is dangerous and can cause messages to bounce around the system, when nodes of OTP 27 and OTP <=26 are alive together. If you do manage to upgrade to OTP 27 on all nodes, then be aware that only a single node will get all start_child requests.

@derekkraan
Copy link
Owner

This is bizarre. Are you saying that libring simply selects the first node every time on OTP 27?

@nash8114
Copy link
Author

nash8114 commented Sep 3, 2024

That is unfortunately exactly what I'm saying. This is an implementation choice due to a fallback in a case statement in key_to_node. https://github.com/bitwalker/libring/blob/main/lib/ring.ex#L212

Works fine on OTP 26

iex(1)> System.build_info
%{
  version: "1.15.7",
  date: "2023-11-22T00:59:42Z",
  otp_release: "26",
  build: "1.15.7 (compiled with Erlang/OTP 26)",
  revision: ""
}
iex(2)> Mix.install([:libring])
==> libring
Compiling 5 files (.ex)

Generated libring app
:ok
iex(3)> ring = HashRing.new() |> HashRing.add_nodes(~w[a b c d]); Enum.map(1..1000, fn n -> HashRing.key_to_node(ring, n) end) |> Enum.frequencies
%{"a" => 262, "b" => 240, "c" => 246, "d" => 252}

The above gives an approximate fair distribution.

Broken on OTP 27

iex(1)> System.build_info
%{
  version: "1.17.2",
  date: "2024-08-13T20:58:12Z",
  otp_release: "27",
  build: "1.17.2 (compiled with Erlang/OTP 27)",
  revision: ""
}
iex(2)> Mix.install([:libring])
==> libring
Compiling 5 files (.ex)

Generated libring app
:ok
iex(3)> ring = HashRing.new() |> HashRing.add_nodes(~w[a b c d]); Enum.map(1..1000, fn n -> HashRing.key_to_node(ring, n) end) |> Enum.frequencies
%{"b" => 1000}

This time around the result is only "b"

@derekkraan
Copy link
Owner

Very unfortunate. Have you already raised the issue with the libring project?

@nash8114
Copy link
Author

nash8114 commented Sep 3, 2024 via email

@derekkraan
Copy link
Owner

Doesn't seem necessary. I'll subscribe to notifications on this PR and issue an update when it is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants