diff --git a/runbooks/source/access-eks-cluster.html.md.erb b/runbooks/source/access-eks-cluster.html.md.erb index 0e0e72eb..65d3ed0f 100644 --- a/runbooks/source/access-eks-cluster.html.md.erb +++ b/runbooks/source/access-eks-cluster.html.md.erb @@ -1,7 +1,7 @@ --- title: Access EKS Cluster weight: 8600 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/change-alias-in-route53.html.md.erb b/runbooks/source/change-alias-in-route53.html.md.erb index fc2c8505..79013f21 100644 --- a/runbooks/source/change-alias-in-route53.html.md.erb +++ b/runbooks/source/change-alias-in-route53.html.md.erb @@ -1,19 +1,19 @@ --- title: Change load balancer alias to the interface IP's in Route53. weight: 358 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- # <%= current_page.data.title %> -This run book is a recovery action to mitigate slow performance of ingress traffic [incident][performance incident] when an interface fails in an availability zone (AZ), clients time out when they attempt to connect to one of unhealthy NLB EIPs +This runbook is a recovery action to mitigate slow performance of ingress traffic [incident][performance incident] when an interface fails in an availability zone (AZ), clients time out when they attempt to connect to one of the unhealthy NLB EIPs ## Request AWS to restart the health check AWS confirmed the root cause of the [incident][performance incident] as, “the health checking subsystem did not correctly detect some of your targets as unhealthy, which resulted in clients timing out when they attempted to connect to one of your NLB EIPs". -AWS mitigated the impact by restarting the health checking service, which caused the target health to be updated appropriately. Cloud-platform team don't have access to restart the health check service, request AWS to restart it for us. +AWS mitigated the impact by restarting the health checking service, which caused the target health to be updated appropriately. The cloud-platform team don't have access to restart the health check service; we request AWS to restart it for us. If restarting still has not resolved the issue, look at changing the load balancer alias. @@ -47,9 +47,9 @@ a76b4f2b1811e4f7589eaca69c4a46c5-b700f2aa70780ce3.elb.eu-west-2.amazonaws.com ha 2) Find the unhealthy NLB EIP -Now we have all of the information we need to make a cURL call over to the external load balancer EIPs. +Now, we have all of the information we need to make a cURL call over to the external load balancer EIPs. -Run this on 3 EIPs of the NLB. If everything is working correctly it would return OK. If it return "Timeout", then it is most likely an unhealthy external load balancer EIP. +Run this on 3 EIPs of the NLB. If everything works correctly, it would return OK. If it returns "Timeout", then it is most likely an unhealthy external load balancer EIP. ``` while :; do (curl -o/dev/null -m1 -k -H 'Host: login.yy-0208-0000.cloud-platform.service.justice.gov.uk' https://35.179.65.116 2>/dev/null && echo "OK") || echo "Timeout" ; sleep 1 ; done @@ -67,11 +67,11 @@ _external_dns.login.yy-0208-0000.cloud-platform.service.justice.gov.uk TXT Weigh "heritage=external-dns,external-dns/owner=yy-0208-0000,external-dns/resource=ingress/kuberos/kuberos" ``` -Edit the route53 TXT record and update the owner, set the incorrect owner field, so external-dns couldn't revert the information in the A record. +Edit the route53 TXT record and update the owner, set the incorrect owner field, so external-dns can't revert the information in the A record. ``` "heritage=external-dns,external-dns/owner=yy-CCCC-BBBB,external-dns/resource=ingress/kuberos/kuberos" ``` -Edit the "A" record and uncheck the alias option, add 2 healthy IP's in the value filed and save the record. Repeat this on all the hosts using the affected NLB. +Edit the "A" record and uncheck the alias option, add 2 healthy IP's in the value field and save the record. Repeat this on all the hosts using the affected NLB. [performance incident]: https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#q3-2022-july-september diff --git a/runbooks/source/creating-a-live-like.html.md.erb b/runbooks/source/creating-a-live-like.html.md.erb index e720af76..ad695791 100644 --- a/runbooks/source/creating-a-live-like.html.md.erb +++ b/runbooks/source/creating-a-live-like.html.md.erb @@ -1,7 +1,7 @@ --- title: Creating a live-like Cluster weight: 350 -last_reviewed_on: 2024-04-10 +last_reviewed_on: 2024-10-16 review_in: 6 months --- @@ -38,7 +38,7 @@ to the configuration similar to the live cluster. 2. Add the `starter_pack_count = 40` variable to the starter_pack module -> Sometimes terraform will error out with an unclear error message this is usually due to a low default `ulimit` to fix this you can set `ulimit -n 2048` +> Sometimes terraform will error out with an unclear error message. This is usually due to a low default `ulimit`. To fix this, you can set `ulimit -n 2048` 3. Run `terraform plan` and confirm that your changes are correct 4. Run `terraform apply` to apply the changes to your test cluster @@ -52,7 +52,7 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html). * Setup pingdom alerts for starter-pack helloworld and multi-container app -> When nodes recycle it's possible that the multi-container app will break giving false positives. +> When nodes recycle, it's possible that the multi-container app will break giving false positives. * Useful command liners * `watch -n 1 "kubectl get events"` - get all Kubernetes events diff --git a/runbooks/source/custom-default-backend.html.md.erb b/runbooks/source/custom-default-backend.html.md.erb index 143bd7ed..0c8c2f0f 100644 --- a/runbooks/source/custom-default-backend.html.md.erb +++ b/runbooks/source/custom-default-backend.html.md.erb @@ -1,6 +1,6 @@ --- title: Custom default-backend -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 weight: 9000 review_in: 3 months --- @@ -16,12 +16,12 @@ However, some applications don’t want to use the cloud-platform custom default ## Creating your own custom error page ### 1. Create your docker image -First create a docker image containing custom HTTP error pages using the [example][ingress-nginx-custom-error-pages] from the ingress-nginx, or [simplified version][cloud-platform-custom-error-pages] created by the cloud platform team. +First, create a docker image containing custom HTTP error pages using the [example][ingress-nginx-custom-error-pages] from the ingress-nginx, or [simplified version][cloud-platform-custom-error-pages] created by the cloud platform team. ### 2. Creating a service and deployment Using this [custom-default-backend][customized-default-backend] example from ingress-nginx, create a service and deployment of the error pages container in your namespace. -To create Deployment and Service manually use this below command: +To create Deployment and Service manually, use the command below: ``` $ kubectl -n ${namespace} create -f custom-default-backend.yaml @@ -80,11 +80,11 @@ spec: port: number: 4567 ``` -> Note - Please change the `ingress-name` and `environment-name` values in the above example, you can get the `environment-name` value from your namespace label "cloud-platform.justice.gov.uk/environment-name". The `colour` should be `green` for ingress in EKS `live` cluster +> Note - Please change the `ingress-name` and `environment-name` values in the above example. You can get the `environment-name` value from your namespace label "cloud-platform.justice.gov.uk/environment-name". The `colour` should be `green` for ingress in EKS `live` cluster ## Use the platform-level error page -Some teams want their application to serve their own error page for example 404s, but want to serve cloud platforms custom error page from ingress controller default backend for other error codes like 502,503 and 504, this can be done by using [custom-http-errors][custom-http-error-annotation] annotation in your ingress for error codes teams want to serve the cloud platforms custom error page. +Some teams want their application to serve their own error page, for example 404s, but want to serve cloud platforms custom error page from ingress controller default backend for other error codes like 502,503 and 504. This can be done by using the [custom-http-errors][custom-http-error-annotation] annotation in your ingress for error codes teams want to serve the cloud platforms custom error page. Example Ingress file to use platform-level error page for custom-http-errors: "502,503,504". All other errors except `502,503,504` will be served from the application error page. diff --git a/runbooks/source/destroy-concourse-build-data.html.md.erb b/runbooks/source/destroy-concourse-build-data.html.md.erb index 798341ab..8a061f8e 100644 --- a/runbooks/source/destroy-concourse-build-data.html.md.erb +++ b/runbooks/source/destroy-concourse-build-data.html.md.erb @@ -1,7 +1,7 @@ --- title: Destroy Concourse Build Data weight: 9000 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/divergence-error.html.md.erb b/runbooks/source/divergence-error.html.md.erb index 504ccf07..64b5506c 100644 --- a/runbooks/source/divergence-error.html.md.erb +++ b/runbooks/source/divergence-error.html.md.erb @@ -1,7 +1,7 @@ --- title: How to Investigate Divergence Errors weight: 210 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/expand.html.md.erb b/runbooks/source/expand.html.md.erb index 3e5bccc0..86b1f632 100644 --- a/runbooks/source/expand.html.md.erb +++ b/runbooks/source/expand.html.md.erb @@ -1,7 +1,7 @@ --- title: Expanding Persistent Volumes created using StatefulSets weight: 600 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/helm-repository.html.md.erb b/runbooks/source/helm-repository.html.md.erb index 5eca39db..7133860b 100644 --- a/runbooks/source/helm-repository.html.md.erb +++ b/runbooks/source/helm-repository.html.md.erb @@ -1,7 +1,7 @@ --- title: Helm Charts Repository weight: 710 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/leavers-guide.html.md.erb b/runbooks/source/leavers-guide.html.md.erb index 556f22a0..cd8f3c1a 100644 --- a/runbooks/source/leavers-guide.html.md.erb +++ b/runbooks/source/leavers-guide.html.md.erb @@ -1,7 +1,7 @@ --- title: Leavers Guide weight: 9100 -last_reviewed_on: 2024-07-12 +last_reviewed_on: 2024-10-16 review_in: 3 months --- diff --git a/runbooks/source/monitor-eks-cluster.html.md.erb b/runbooks/source/monitor-eks-cluster.html.md.erb index c547374a..d17247d0 100644 --- a/runbooks/source/monitor-eks-cluster.html.md.erb +++ b/runbooks/source/monitor-eks-cluster.html.md.erb @@ -1,7 +1,7 @@ --- title: Monitor EKS Cluster weight: 70 -last_reviewed_on: 2024-04-10 +last_reviewed_on: 2024-10-16 review_in: 6 months --- diff --git a/runbooks/source/recycle-node.html.md.erb b/runbooks/source/recycle-node.html.md.erb index 4b974eaf..c07de794 100644 --- a/runbooks/source/recycle-node.html.md.erb +++ b/runbooks/source/recycle-node.html.md.erb @@ -1,13 +1,13 @@ --- title: Manually run recycle node command weight: 250 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-10 review_in: 3 months --- # Recycle-node -The [recycle-node pipeline][recyclenode-pipeline-definition] runs every day on the `live` cluster, it executes the [cloud-platform cli][recycle-node-cli] command to replace the oldest worker node by: +The [recycle-node pipeline][recyclenode-pipeline-definition] runs every day on the `live` cluster. It executes the [cloud-platform cli][recycle-node-cli] command to replace the oldest worker node by: * Cordoning the oldest node * Draining the node @@ -19,7 +19,7 @@ To recycle to oldest node on the cluster in your current context: cloud-platform cluster recycle-node --oldest -To recycle a given node on the cluster in your current context +To recycle a given node on the cluster in your current context,: cloud-platform cluster recycle-node --name ip-XXX.XX.XX.XX.eu-west-2.compute.internal @@ -33,7 +33,7 @@ FATA[0000] node ip-172-20-53-167.eu-west-2.compute.internal is already cordoned, cloud-platform cluster recycle-node --ignore-label -The other optional flags are +The other optional flags are: ```bash --aws-access-key string aws access key to use diff --git a/runbooks/source/revoke-user-auth0-kubeconfig-access-token.html.md.erb b/runbooks/source/revoke-user-auth0-kubeconfig-access-token.html.md.erb index b30b6709..8cdede5e 100644 --- a/runbooks/source/revoke-user-auth0-kubeconfig-access-token.html.md.erb +++ b/runbooks/source/revoke-user-auth0-kubeconfig-access-token.html.md.erb @@ -1,7 +1,7 @@ --- title: Revoke auth0 kubeconfig access token weight: 275 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-16 review_in: 3 months --- @@ -13,9 +13,9 @@ Use this runbook if we make changes to the Auth0 authorisation process and requi GitHub is being used as an OIDC provider. Once you've logged in to GitHub, it provides an ID token(valid for 10 hours), which is a signed JWT containing your GitHub username and a a list of teams you're in. -To revoke the tokens you need MOJ organisation administrator access, if you are not a Github admin request some one in the team who are Github admin to do it for you. +To revoke the tokens, you need MOJ organisation administrator access. If you are not a Github admin, request someone in the team who is a Github admin to do it for you. -Once you logged in as MOJ github Organization administrator, go into [settings](https://github.com/organizations/ministryofjustice/settings/profile), select developer settings and Oauth [Apps](https://github.com/organizations/ministryofjustice/settings/applications) and search for "MOJ Cloud Platforms Auth0 (prod)" +Once you are logged in as an MOJ github Organization administrator, go into [settings](https://github.com/organizations/ministryofjustice/settings/profile), select developer settings and Oauth [Apps](https://github.com/organizations/ministryofjustice/settings/applications) and search for "MOJ Cloud Platforms Auth0 (prod)" Click on the "Revoke all user tokens" button, this will force users to reauthenticate to get a new token. @@ -57,7 +57,7 @@ $ terraform apply -target=module.kuberos #### 3) Verifiying changes -In order to verify that the changes were successfully applied +In order to verify that the changes were successfully applied,: - You can authenticate to the cluster (follow [user guide](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/getting-started/kubectl-config.html#authentication)) diff --git a/runbooks/source/scheduled-pr-reminders.html.md.erb b/runbooks/source/scheduled-pr-reminders.html.md.erb index a2bea665..8b08362d 100644 --- a/runbooks/source/scheduled-pr-reminders.html.md.erb +++ b/runbooks/source/scheduled-pr-reminders.html.md.erb @@ -1,7 +1,7 @@ --- title: Scheduled PR Reminders weight: 9101 -last_reviewed_on: 2024-07-10 +last_reviewed_on: 2024-10-10 review_in: 3 months --- @@ -11,19 +11,19 @@ Scheduled reminders help the Cloud Platform focus on the most important review r All reminders are created on Github Team level - For the Cloud Platform team, the team is `webops` -To view all scheduled reminders for team webops; +To view all scheduled reminders for team webops: https://github.com/ministryofjustice > Teams > Webops > Settings > Scheduled Reminders -There is currently 2 reminders setup; +There are currently 2 reminders setup; - **cloud-platform-notify** - Report all open PRs for all cloud-platform-* repos every hour between 9am-5pm UTC Monday to Friday. + Reports all open PRs for all cloud-platform-* repos every hour between 9am-5pm UTC Monday to Friday. - **cloud-platform** - Report all open PRs for the cloud-platform-environments and cloud-platform-infrastructure repos at 9am UTC Monday to Friday. + Reports all open PRs for the cloud-platform-environments and cloud-platform-infrastructure repos at 9am UTC Monday to Friday. ### Steps required for new repositories