Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[veda-binder] Update the underlying infra #4595

Merged
merged 14 commits into from
Aug 9, 2024

Conversation

GeorgianaElena
Copy link
Member

This is for #4576

Copy link

github-actions bot commented Aug 8, 2024

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
aws nasa-veda No Yes Following helm chart values files were modified: staging.values.yaml

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws nasa-veda prod Following helm chart values files were modified: prod.values.yaml
aws nasa-veda binder Following helm chart values files were modified: binder.values.yaml

@sgibson91
Copy link
Member

sgibson91 commented Aug 8, 2024

I may have broken this now as the build pod is no longer able to connect to the docker socket

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    79s                default-scheduler  Successfully assigned binder/build-binder-2dexamples-2drequirements-55ab5c-505-2c to ip-192-168-30-60.us-west-2.compute.internal
  Warning  FailedMount  15s (x8 over 79s)  kubelet            MountVolume.SetUp failed for volume "docker-socket" : hostPath type check failed: /var/run/binder-binder/docker-api/docker-api.sock is not a socket file

The taints/tolerations are also failing because ip-192-168-30-60.us-west-2.compute.internal is in the nb-r5-xlarge nodegroup, and it should be scaling up the nb-binder-r5-xlarge nodegroup. Also I gave ip-192-168-30-60.us-west-2.compute.internal an extra taint to drain according to https://infrastructure.2i2c.org/howto/upgrade-cluster/aws/#performing-rolling-upgrades-using-drain-or-not to wait for them to drain, so I'm not sure why it's being scheduled there.

@sgibson91
Copy link
Member

My problems above seem to be caused by trying to add the 2i2c:node-purpose tag - I don't know why because tags should just be at the AWS-level, not the k8s level. But I reverted my attempts and the binder works again now. So I'm going to merge this as-is.

@sgibson91
Copy link
Member

sgibson91 commented Aug 9, 2024

Currently build pods are not being scheduled on the binder nodegroup. I think the node selector and toleration need to be added to the section below, but I don't know what the appropriate key is. I've tried extraTolerations, extra_tolerations, and tolerations - none of which worked.

https://github.com/GeorgianaElena/pilot-hubs/blob/263fc911b6ff5da4ed3cd22c244d2e00304248de/config/clusters/nasa-veda/binder.values.yaml#L111-L115


I wonder if this is related to the build pods receiving this error?

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    50s                default-scheduler  Successfully assigned binder/build-binder-2dexamples-2dr-974603-02c-0e to ip-192-168-31-239.us-west-2.compute.internal
  Warning  FailedMount  18s (x7 over 50s)  kubelet            MountVolume.SetUp failed for volume "docker-socket" : hostPath type check failed: /var/run/binder-binder/docker-api/docker-api.sock is not a socket file

Because the build pods are not getting scheduled to the binder nodegroup, the docker API isn't available for new builds?

eksctl/nasa-veda.jsonnet Outdated Show resolved Hide resolved
@sgibson91
Copy link
Member

I ended up replicating #4482 entirely and putting each hub on it's own nodegroup with a "hub-name" label. This means we no longer need to use taints/tolerations and the build pods get scheduled into the binder nodegroup and are able to connect to the docker socket.

I also added the "node-purpose" tags while I was at it.

@sgibson91 sgibson91 merged commit 726a441 into 2i2c-org:main Aug 9, 2024
9 checks passed
Copy link

github-actions bot commented Aug 9, 2024

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/10319180463

@GeorgianaElena GeorgianaElena deleted the binder.nasa-veda.2i2c.cloud branch August 26, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants