Skip to content

Latest commit

 

History

History
188 lines (136 loc) · 9.05 KB

File metadata and controls

188 lines (136 loc) · 9.05 KB

Test Infra Lead Handbook

Note: Currently, the test-infra lead has to be someone from Google GKE Engprod Team, in order to gain access to the prow cluster. This will change once we migrate our testing infrastructure under CNCF account. (xref kubernetes/test-infra#5085)

There are three major area that test-infra lead need to take care during the release cycle, which are:

  1. Create CI/Presubmit jobs for the new release, and populate the Testgrid dashboard

  2. Configure merge automation for code slush, freeze, and thaw

  3. Watch for test infra status, make sure test infra is stable, react to test infra related issues and notify Release Lead and CI Signal Lead of issue status changes

You can work with @kubernetes/test-infra-maintainers or test infra oncall if you are blocked by anything. Also feel free to ping the #sig-testing and #testing-ops Kubernetes Slack channels to reach out for help.

Create CI/Presubmit jobs for the new release

This step should happen in week 6-7, when we create the new release branch.

Most of the release blocking jobs are named with -beta|-stableX, which are mapped to our release channels.

Note that this section reflects the status of the world today, we are actively looking for simplify the process.

  1. Bump build job branches for the k8s build jobs

  2. Create kubekins images for the new release, add a new release target in the kubekins Makefile

  3. Update release version in the image bump script and push new kubekins images by running the script. (Note that the runner need to have access to k8s-testimages gcp project)

  4. Similarly, make a new Dockerfile for kubekins-test image, this is the image we used for our integration and verify jobs. Also bump the image tags in the kubernetes_verify scenario

  5. grep for manual-release-bump-required under test-infra, those are the jobs that need to be manually bumped per release cycle, remap them to the up-to-date branches. Similar to 2, Fork a new version of kubernetes/kubernetes presubmit job, and remove references to the older branches.

  6. Okay, now let's update the Testgrid config. It's a manual work now, basically you want to find dashboard tabs for release-1.x, and bump that, and the jobs inside, to release-1.(x+1)

  7. Finally, update the release target section

Not all the steps need to happen together, some new jobs, like bazel-build/integration/verify will require images to be pushed before they can work properly.

Configure merge automation for code slush, freeze, and thaw

The code slush, code freeze, and code thaw dates in the release cycle mark points at which merge requirements for PRs in the master branch and release-<current-release-number> change. The remaining branches are release-X.X branches for previous releases and are unaffected by the release cycle. Code slush and freeze are the two phases of the release cycle with additional merge requirements. Code thaw marks the switch back to the development (normal) phase.

Tide

The tool that we use to automate merges is called Tide. Its configuration lives in config.yaml. Tide identifies PRs that are mergeable using GitHub queries that correspond to the entries in the queries field. Here is an example of what the query config for kubernetes/kubernetes looks like without additional constraints related to the release cycle:

  - repos:
    - kubernetes/kubernetes
    labels:
    - lgtm
    - approved
    - "cncf-cla: yes"
    missingLabels:
    - do-not-merge
    - do-not-merge/blocked-paths
    - do-not-merge/cherry-pick-not-approved
    - do-not-merge/hold
    - do-not-merge/invalid-owners-file
    - do-not-merge/release-note-label-needed
    - do-not-merge/work-in-progress
    - needs-kind
    - needs-rebase
    - needs-sig

During code slush and freeze we use two queries instead of one for the kubernetes/kubernetes repo. One query handles the master and current release branches while the other query handles all other branches. The partition is achieved with the includedBranches and excludedBranches fields.

Code Slush

Code slush is when merge requirements for the master and current release branch diverge from the requirements for the other branches so this is when we split the kubernetes/kubernetes Tide query into two queries.

We only add one additional merge requirement for PRs to these two branches for code slush:

  • PRs must be in the GitHub milestone for the current release (e.g. v1.12).

Milestone requirements are configured by adding milestone: foo to a query config.

  - repos:
    - kubernetes/kubernetes
    milestone: v1.12
    includedBranches:
    - master
    - release-1.12
    labels:
    - lgtm
    - approved
    - "cncf-cla: yes"
    missingLabels:
      # as above...
  - repos:
    - kubernetes/kubernetes
    excludedBranches:
    - master
    - release-1.12
    labels:
    - lgtm
    - approved
    - "cncf-cla: yes"
    missingLabels:
      # as above...

Code Freeze

Code freeze adds one more merge requirement for PRs in the master and current release branches:

  • PRs must have the priority/critical-urgent label.

This label requirement is configured by adding priority/critical-urgent to the list specified by the labels field.

  - repos:
    - kubernetes/kubernetes
    milestone: v1.12
    includedBranches:
    - master
    - release-1.12
    labels:
    - lgtm
    - approved
    - priority/critical-urgent
    - "cncf-cla: yes"
    missingLabels:
      # as above...
  - repos:
    - kubernetes/kubernetes
    excludedBranches:
    - master
    - release-1.12
    labels:
    - lgtm
    - approved
    - "cncf-cla: yes"
    missingLabels:
      # as above...

Code Thaw

Code thaw removes the release cycle merge restrictions and replaces the two queries with a single one. We remain in this state until the next code slush.

  - repos:
    - kubernetes/kubernetes
    labels:
    - lgtm
    - approved
    - "cncf-cla: yes"
    missingLabels:
      # as above...

Ensure the stability of test infra

During the release cycle, especially inside the code freeze, the test infra lead need to actively watch for

  1. If the presubmit/CI is failing due to test infra issues (do some initial triage with CI Signal Lead)

  2. If Tide is merging PRs into the master and release branches

We record test-infra commit SHAs in each Testgrid tab, and if CI starts to fail between two test-infra commits, test infra lead can diff the SHAs to triage if the failure is caused by a test-infra change.

The velodrome monitoring dashboard will be your good friends.

Monitoring Tide

It is important to monitor Tide after config changes are made for code slush, freeze and thaw to ensure that the changes are having the intended effect.

Until the CNCF infra migration is complete, a member of Google's gke-engprod team will need to monitor Tide logs. However, most of Tide's behavior can be monitored without access to the cluster. The Tide dashboard and Velodrome monitoring dashboard provide insight into what Tide is currently doing, how much load it is handling, and how it is performing.

Test-Infra 'Code Freeze'

The stability of our test infra is critical to getting reliable testing signals throughout the release cycle, but the signal is most important at the end of the release cycle during code slush and freeze. While the kubernetes/test-infra repo does not enforce additional merge restrictions related to the release cycle, we do try to limit the changes that are merged. Specifically, during slush and freeze, changes to test-infra should be limited to important fixes and work that doesn't impact critical infrastructure. Large changes should be delayed if possible. In particular, bumping the kubekins-e2e images should be avoided unless a critical fix in necessary.

Useful Links

Test Infra Home Page

Prow Home Page

Tide