Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: grpc conn refresh #690

Merged
merged 3 commits into from
Nov 23, 2021
Merged

fix: grpc conn refresh #690

merged 3 commits into from
Nov 23, 2021

Conversation

fengjiachun
Copy link
Contributor

@fengjiachun fengjiachun commented Oct 9, 2021

Motivation:

when the grpc connection failures too many times(default: 3), reset the connect backoff and reconnect immediately

Result:

Fixes #683

If there is no issue then describe the changes introduced by this PR.

@sofastack-bot sofastack-bot bot added bug Something isn't working cla:yes size/L labels Oct 9, 2021
@fengjiachun fengjiachun force-pushed the fix_grpc_conn_refresh branch 3 times, most recently from 1d16a70 to 84aff5d Compare October 11, 2021 01:36
@fengjiachun fengjiachun changed the title fix: grpc conn refresh [WIP]fix: grpc conn refresh Oct 13, 2021
@fengjiachun fengjiachun changed the title [WIP]fix: grpc conn refresh fix: grpc conn refresh Oct 13, 2021
@fengjiachun
Copy link
Contributor Author

TODO: add some test

@fengjiachun fengjiachun changed the title fix: grpc conn refresh [WIP]fix: grpc conn refresh Oct 25, 2021
@fengjiachun
Copy link
Contributor Author

修改了以下几个地方:

  1. gRPC 的 notifyWhenStateChanged API 之前一直理解错误了,用反了(虽然没有影响主流程逻辑),我会在下面贴出这个 API 的详细注释
  2. 在获取一个连接时就检查一下它的 state,默认在第二次 TRANSIENT_FAILURE 时尝试调用 resetConnectBackoff 来刷新 dns,在第三次 TRANSIENT_FAILURE 会直接移除 channel,下次重新建连
  3. UT 比较难写,我写了一个手动挡的测试程序,启动一个 rpcServer(127.0.0.1),再启动一个 client 目标地址为 my.test.host(先设置为 127.0.0.2),不断的向 server 发送 ping request,可以想到因为 127.0.0.2 这个地址不存在, 刚开始会一直失败。此时再修改 my.test.host 为 127.0.0.1 ,一会 client 就可以成功发送 ping 到 server

测下来,有待改善的地方为:首先修改 host 后到 jraft 能识别需要时间(这个 ttl 可以用户进行设置),其次对于一个失效的 IP,因为这个 IP 端是没有内核监听的(不能立刻返回 RST),所以 client 直到超时才会退出建连返回失败(这取决于程序设置的 rpcTimeout)
所以,整个过程速度不快,但是在 #683 场景里应该没什么问题,最终 leader 连上这个变了 ip 的 follower 即可,不过这种变更就不能很快的执行多个节点了,否则 leader 一定因为多数派挂掉而 stepdown

@fengjiachun fengjiachun changed the title [WIP]fix: grpc conn refresh fix: grpc conn refresh Nov 23, 2021
@killme2008 killme2008 merged commit 613fdde into master Nov 23, 2021
@killme2008 killme2008 deleted the fix_grpc_conn_refresh branch November 23, 2021 07:09
@fengjiachun fengjiachun mentioned this pull request Dec 3, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla:yes size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leader election after node restart
3 participants