Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing flow control appears to block traffic #180

Open
marwinski opened this issue Mar 3, 2021 · 17 comments
Open

Missing flow control appears to block traffic #180

marwinski opened this issue Mar 3, 2021 · 17 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@marwinski
Copy link

We have integrated the apiserver-network-proxy into our project and stumbled across some possibly more fundamental issues with it. Don’t get me wrong: we really like this project and really appreciate the work that has been done, however those issues would ultimately not allow us to use it, but our understanding of the whole domain might be too limited. This is what we know to be issues or suspect based on our observations:

  1. My colleague @ScheererJ has filed PR Add a timeout in when the proxy agent dials to the remote #179 fixing an issue that caused our tests to fail. Problem was that a blocked Dial call caused the receive loop to hang for up to 10 minutes blocking all tunnel traffic. Making this asynchronous fixed the tests. We consider this to be a valid fix for the issue.

  2. Even with (1) we see that the runtime of the tests is twice that compared to using our traditional connectivity solution (that has its own issues which is the reason why we want to replace it). We have no proof but strongly suspect the root cause to be in https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/agent/client.go#L417
    (plus a duplicate of this in the server). The problem is that if the receiver (the process to which the agent forwards the traffic) is slow to process data for whatever reason it will block all connectivity inside the tunnel just like with (1) above.
    The only way to fix this from our point of view is to introduce some kind of flow control blocking the sender (=the kube-apiserver) until the other side has received the data (or vice versa). This is of course not trivial and we have no fix or proposal for this one so far.

  3. The konnectivity agent does not handle the case when a connection to a connectivity server is broken; possibly due to a restart of the konnectivity server. In this case it will consume resources (threads, possibly cpu, write loads of logs) without ever stopping. We consider this minor but also have no fix for this so far.

  4. Especially with “kubectl exec” we have seen very poor throughput (like on a 56k modem). We have not done a deep dive but suspect that more or less lots and lots of tiny grpc messages are being sent causing the poor throughput.

We do see that the apiserver-network-proxy is used productively but especially with issues (1) and (2) above we somewhat lack the imagination how this could work (keep my disclaimer about our limited know-how in mind). Anyway, we would really like to use the network proxy as we have already done all the integration work and are willing to help here. Could you share some insights especially with regards to (2) above and how we could help.

Thanks

@marwinski
Copy link
Author

We have confirmed that issue 2 above (missing flow control) is indeed an issue although the operating system buffers appear to be slightly bigger than anticipated. We have worked on a resolution in our fork for review mandelsoft@bb4a77d

We can show a blocked tunnel without the patch that does not block with it.

@ydp
Copy link

ydp commented Apr 29, 2021

I think we are encounter similar issue as (4), in out scale test, some cases always timeout from kubectl side, but the actual workloads running fine.
HI @marwinski have you ever encounter kubectl logs issue like this #167 , I am searching on the possible cause, but no luck.

@marwinski
Copy link
Author

Hi @ydp, this issue could well cause frequent timeouts from kubectl logs as it goes via the tunnel. #167 looks like the error handling issue described in (3) above.

You should be able to get rid of the blocking issues with our two pull requests mandelsoft@bb4a77d and #179. Keep in mind that we have not done detailed testing of those - except to validate that they indeed fix some or even all blocking issues. Especially the first fix is quite advanced; the second one is much easier and will probably fix your issue.

This however does not fix #167 although this should be quite forward to fix (add a bit of error handling).

@rtheis
Copy link

rtheis commented Jun 1, 2021

Hi folks, we've been doing a lot of Kubernetes conformance testing with Konnectivity version 0.0.19 enabled for Kubernetes version 1.21 clusters.

Here is the basic structure of our test invocation.

kubetest --provider=skeleton --ginkgo-parallel="${k8s_conformance_e2e_num_parallel_test_nodes}" --test --test_args="--ginkgo.noColor --ginkgo.flakeAttempts=2 --ginkgo.focus=\[Conformance\] --ginkgo.skip=\[Serial\]|\[Disruptive\] ${EXTRA_TEST_ARGS}" --check-version-skew=false --dump="${k8s_conformance_e2e_results_directory}"

These tests have consistently failed with Konnectivity. Scaling up k8s_conformance_e2e_num_parallel_test_nodes only makes the problem worse. Replacing Konnectivity with our previous proxy solution, the tests consistently pass with k8s_conformance_e2e_num_parallel_test_nodes=25.

We then built Konnectivity with #179 and now the tests pass with k8s_conformance_e2e_num_parallel_test_nodes=25. Hopefully this data will help expedite the review and merge of #179 and a new release of Konnectivity with this critical fix. Thanks.

@mihivagyok
Copy link
Contributor

@marwinski
Hi! Could you please give some testing detail how to re-produce the issue 2 (blocking the tunnel)?

Thank you!
Adam

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021
@rtheis
Copy link

rtheis commented Sep 1, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2021
@rtheis
Copy link

rtheis commented Nov 30, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2021
@relyt0925
Copy link
Contributor

/remove-lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2022
@rtheis
Copy link

rtheis commented May 2, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@rtheis
Copy link

rtheis commented Aug 1, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022
@rtheis
Copy link

rtheis commented Oct 30, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022
@tallclair
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants