AWS EKS doesn't update the cluster automatically.
Subscribe to the Amazon Linux AMI Security Bulletin
- Check if new EKS AMI available after ALAS2 alert
- if needed increase worker count via builder (unless we have autoscaling)
- Manually drain and kill each node that uses old AMI
- Check in EC2 console if workers are using new AMI
To manually drain and kill the nodes:
kubectl get nodes #
kubectl cordon my-node # no new Pods will be scheduled here
kubectl drain --ignore-daemonsets my-node # existing Pods will be evicted and sent to another node
aws ec2 terminate-instances --instance-ids=... # terminate a node, a new one will be created
kubectl drain
will complain if pods are using local data storage or if evicting a pod would violate a PodDisruptionBudget
.
You can force the eviction using --delete-local-data
and --disable-eviction
respectively.
Check which pods are complaining before doing this and make sure that this wouldn't break production services.
Copied from builder docs.
- check aws docs for availability and notes
- use silver-surfer/kubedd to check for api deprecations
kubedd --target-kubernetes-version=1.22
(example for 1.22 upgrade)DEPRECATED
is okay, but if an api isDELETED
in the new k8s version you will have to fix the affected charts. - bump k8s version (one minor at a time) in elife.yaml
- apply using
builder/bldr update_infrastructure:kubernetes-aws--flux-prod
This should change the EKS (i.e k8s control plane) and AutoscalingGroup AMI image. - If
flux
fails to access the api after the EKS upgrade, try restarting it withkubectl -n flux rollout restart deployment flux
- upgrade
kube-proxy
(see aws docs) - drain and terminate node by node as described above to upgrade the workers
Changing api versions in the chart can lead to helm complaining about existing resource conflict
.
This appears to be an issue with helm3 that helm-operator is aware of but can't fix until helm3 fixes it upstream.
To fix: delete the resource e.g. Deployment, DaemonSet, StatefulSet with kubectl
. They should automatically be replaced by the new version. This will cause brief downtime.
https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html https://github.com/elifesciences/builder/blob/master/docs/eks.md#ami-update