Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: support denylist Warn feature #3539

Merged
merged 2 commits into from
Jul 29, 2023

Conversation

nikita-dubrovskii
Copy link
Contributor

@nikita-dubrovskii nikita-dubrovskii commented Jul 18, 2023

#3344

We bubble denylisted tests with 'Warn: true' option as warnings rather than hard failures:
```
kola -p qemu run --parallel 8 ext.config.ntp.* --output-dir tmp/kola

⏭️  Skipping kola test pattern "ext.config.ntp.chrony.dhcp-propagation":
  👉 https://github.com/coreos/fedora-coreos-tracker/issues/1508
⚠️  Warning kola test pattern "ext.config.ntp.timesyncd.dhcp-propagation", snoozing expired on Jul 20 2023:
  👉 https://github.com/coreos/fedora-coreos-tracker/issues/1508
=== RUN   ext.config.ntp.chrony.coreos-platform-chrony-config
=== RUN   ext.config.ntp.timesyncd.dhcp-propagation
--- PASS: ext.config.ntp.chrony.coreos-platform-chrony-config (27.56s)
--- WARN: ext.config.ntp.timesyncd.dhcp-propagation (90.72s)
        cluster.go:162: Error: Unit kola-runext.service exited with code 1
        cluster.go:162: 2023-07-27T06:24:18Z cli: Unit kola-runext.service exited with code 1
        harness.go:1236: kolet failed: : kolet run-test-unit failed: Process exited with status 1
FAIL, output in tmp/kola
+ rc=0
+ set +x
```

Comment on lines 868 to 873
// Ignore the error when only denied tests with Warn:true feaute failed
if runErr != nil && allTestsWarnOnError(testResults.getResults()) {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal here is that in our pipeline we are able to know when a test that fails is warn: true and "warn" in the pipline (warn in jenkins is different; the build continues but does get flagged with yellow warn status). This is crucial because otherwise we don't get notified when a warning happens.

I'm thinking we should surface this up and return a non-zero return code, but defined return code so that callers (like jenkins) can adjust their behavior. For example if we returned 77 here jenkins could check the return code if it's non-zero and if it is 77 then warn instead of fail. Here is an example where we added a new option to change the return code in #1152

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this idea and posted some code over in https://github.com/dustymabe/coreos-assembler/commits/PR3539 - let me know what you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add 77 with a separate PR, but not here (to keep warn:true feature support simple).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add 77 with a separate PR, but not here (to keep warn:true feature support simple).

Yes, if you don't mind we can open a followup PR for something like changing the exit code and then update coreos-ci-lib to change behavior of kola() based on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

mantle/kola/harness.go Outdated Show resolved Hide resolved
mantle/kola/harness.go Outdated Show resolved Hide resolved
Comment on lines 907 to 909
if IsWarningOnFailure(test) {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. For example denylisted tagged warn: true tests shouldn't affect allow-rerun-success tag=foo failed tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see now. if a warn:true test didn't have tag=foo then the loop would exit early because it found a test that didn't allow rerun success, but we don't want this to count against that.

Maybe the better thing to do here would be to not rerun warn:true tests. i.e. we'd filter it out in getRerunnable( somehow. I'm not sure if that's a great idea or not, but spitballing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've removed this check and now kola simply reruns everything. Anyhow we sort denylist + warn out later

mantle/kola/harness.go Outdated Show resolved Hide resolved
mantle/kola/harness.go Outdated Show resolved Hide resolved
mantle/kola/harness.go Show resolved Hide resolved
@nikita-dubrovskii nikita-dubrovskii force-pushed the issue-3344 branch 2 times, most recently from 5ccf356 to fdedefa Compare July 27, 2023 06:43
dustymabe
dustymabe previously approved these changes Jul 28, 2023
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM - some comments

mantle/kola/harness.go Outdated Show resolved Hide resolved
Comment on lines 907 to 909
if IsWarningOnFailure(test) {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see now. if a warn:true test didn't have tag=foo then the loop would exit early because it found a test that didn't allow rerun success, but we don't want this to count against that.

Maybe the better thing to do here would be to not rerun warn:true tests. i.e. we'd filter it out in getRerunnable( somehow. I'm not sure if that's a great idea or not, but spitballing.

mantle/kola/harness.go Outdated Show resolved Hide resolved
Comment on lines 18 to 21
Fail TestResult = "FAIL"
Warn TestResult = "WARN"
Skip TestResult = "SKIP"
Pass TestResult = "PASS"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random comment: In the future I think it would be cool if we could use colors for these outputs. I'm thinking red, yellow, white, green.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about s/white/blue/ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this work for you? I can't get colors to output locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me that works. tmux and gnome-terminal have no problems

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I got it working :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was wrong?

Copy link
Member

@dustymabe dustymabe Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok now the CI unittests are failing because they don't expect the color in the output. We could add the expected color to the CI tests.. Alternatively I wonder if we should try to check to see if the output is to a terminal (tty) or not and add the color based on that. Something similar to https://github.com/coreos/rpm-ostree/blob/dd370d13d3d41da0e3d61fc9ace558d6bc349f7f/src/libpriv/rpmostree-editor.cxx#L37-L55

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example is NetworkManager. nmcli c show will show colors, but nmcli c show | cat won't because the output is to a pipe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was wrong?

I didn't have the code checked out right locally :(

dustymabe
dustymabe previously approved these changes Jul 28, 2023
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nikita-dubrovskii and others added 2 commits July 28, 2023 16:33
coreos#3344

We bubble denylisted tests with 'Warn: true' option as warnings rather than hard failures:
```
kola -p qemu run --parallel 8 ext.config.ntp.* --output-dir tmp/kola

⏭️  Skipping kola test pattern "ext.config.ntp.chrony.dhcp-propagation":
  👉 coreos/fedora-coreos-tracker#1508
⚠️  Warning kola test pattern "ext.config.ntp.timesyncd.dhcp-propagation", snoozing expired on Jul 20 2023:
  👉 coreos/fedora-coreos-tracker#1508
=== RUN   ext.config.ntp.chrony.coreos-platform-chrony-config
=== RUN   ext.config.ntp.timesyncd.dhcp-propagation
--- PASS: ext.config.ntp.chrony.coreos-platform-chrony-config (27.56s)
--- WARN: ext.config.ntp.timesyncd.dhcp-propagation (90.72s)
        cluster.go:162: Error: Unit kola-runext.service exited with code 1
        cluster.go:162: 2023-07-27T06:24:18Z cli: Unit kola-runext.service exited with code 1
        harness.go:1236: kolet failed: : kolet run-test-unit failed: Process exited with status 1
FAIL, output in tmp/kola
+ rc=0
+ set +x
```

Co-authored-by: Dusty Mabe <[email protected]>
With the introduction of warn: true the logic gets complicated.
This is an attempt to simplify the logic and handle all corner cases.
@dustymabe
Copy link
Member

/retest

@openshift-ci
Copy link

openshift-ci bot commented Jul 28, 2023

@nikita-dubrovskii: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos 5c2f264 link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dustymabe
Copy link
Member

the prow/rhcos failure seems unrelated to this. Merging.

@dustymabe dustymabe merged commit f38e1df into coreos:main Jul 29, 2023
2 checks passed
@nikita-dubrovskii nikita-dubrovskii deleted the issue-3344 branch July 29, 2023 09:46
dustymabe added a commit to dustymabe/fedora-coreos-releng-automation that referenced this pull request Aug 22, 2023
warn: true was added in coreos/coreos-assembler#3539
and can be used to make test failures non-fatal in the pipeline. Let's
strip those from the denylist too on promotion.
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this pull request Aug 22, 2023
This is a new feature that was added recently in
coreos/coreos-assembler#3539.

Now when a snooze expires we can configure it to warn
us and continue to build rather than marking the entire
run as a failure.
dustymabe added a commit to coreos/fedora-coreos-releng-automation that referenced this pull request Aug 23, 2023
warn: true was added in coreos/coreos-assembler#3539
and can be used to make test failures non-fatal in the pipeline. Let's
strip those from the denylist too on promotion.
dustymabe added a commit to coreos/fedora-coreos-config that referenced this pull request Aug 23, 2023
This is a new feature that was added recently in
coreos/coreos-assembler#3539.

Now when a snooze expires we can configure it to warn
us and continue to build rather than marking the entire
run as a failure.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
This is a new feature that was added recently in
coreos/coreos-assembler#3539.

Now when a snooze expires we can configure it to warn
us and continue to build rather than marking the entire
run as a failure.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
This is a new feature that was added recently in
coreos/coreos-assembler#3539.

Now when a snooze expires we can configure it to warn
us and continue to build rather than marking the entire
run as a failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants