OTA-1177: Gather OSUS data #416

oarribas · 2024-04-12T10:27:11Z

Collect data from OSUS operator if installed in the cluster.

openshift-ci-robot · 2024-04-12T10:27:14Z

@oarribas: This pull request references OTA-1177 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Collect data from OSUS operator if installed in the cluster.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2024-04-12T18:39:27Z

collection-scripts/gather_osus

+OSUS_OPERATOR_NAME="update-service-operator"
+get_log_collection_args
+
+HAS_OSUS=$(oc get csv -A --no-headers -o custom-columns=NS:.metadata.namespace,OPERATOR:.metadata.name --ignore-not-found=true | awk '/'${OSUS_OPERATOR_NAME}'/ {print $1}')


This seems like a hard direction for an OCP-central must-gather to support. I thought the OLM-installed operator pattern was for each operator to ship a separate image with the must-gather-for-them logic (as described in the KCS you'd linked from OTA-1177, and also in these docs), so the central must-gather maintainers didn't have to be bothered reviewing gather logic for all the many, many, OLM-installed operators that could possibly be present?

@wking , while I agree with your comment for most of the operators that can be installed in OpenShift, I think creating a full image for this operator is excessive.
Other operators provide extra capacities to the cluster, and usually involve several different CRDs and even several namespaces. This operator helps with one of the core capacities of OpenShift, which is the upgrade (in this case, for disconnected clusters), and lots of cases already includes a must-gather proactively. The info for this operator (when installed) shouldn't increase a lot the size of the must-gather, and will avoid asking for a new different must-gather.

@wking , any thoughts on the above?

@oarribas / @wking the only thing this collects are the details from a given namespace[s]. I think this is fine (given its related to update); but @soltysh or @ingvagabund should have the final say if this is provided by this image or an independent image.

@oarribas how big are the extra manifests? Can you share an example of running both commands?

@ingvagabund , checking data from some cases, it depends a lot on the volume of the logs from the pods.
Largest inspect of the namespace I have seen, without compression is 30MB, and compressed between 2-3MB. The size of the updateservice resource is few KB.

@ingvagabund , @soltysh , any thoughts based on the above?

For some time we have been merging extra collection with a promise of collecting extra "small" data that help to avoid asking customers to run yet another must gather image. More relevant in the disconnected environment. Yet, we don't track how much of these extra collections increase the overall must-gather size on average. Before going further I'd like to see when all the extra collections are collected. E.g. what's the estimated worst case extra bump in the collected data. A new section under https://github.com/openshift/must-gather/blob/master/README.md will do. E.g.

Extra collections

script location short description condition estimated size

/usr/bin/gather_aro Gather ARO Cluster Data ns/openshift-azure-operator or ns/openshift-azure-logging present ??

/usr/bin/gather_vsphere Gather vSphere resources vSphere CSI driver is installed ??

... ... ... ...

@ingvagabund , the estimated size for this one (for the OSUS operator), when compressed, is around 3MB (usually less as per several inspect of the openshift-update-service namespace I have reviewed). And it's only collected if OSUS is installed.

Regarding the above for the estimated "worst" case, things like ARO and vSphere cannot be collected at the same time if the conditions are OK.

ingvagabund · 2024-08-23T08:25:42Z

/approve
/lgtm

ingvagabund · 2024-08-23T08:33:42Z

@kasturinarra just for curiosity do we have any test case/jobs monitoring how much an average must-gather image grows in time in context of various installations?

@oarribas are there any statistics about must-gather size? E.g. a matrix of which operators are installed -> how much data can be collected. Or, what are the variable parts that can significantly increase the size?

@sferich888 is it possible to make a matrix of all flavors of an OCP cluster? Including layered products? To see how complex a must-gather gathering can be? I am quite blind in here. Would like to extend my perspective so we can make better decisions when reviewing this kind of additions.

oarribas · 2024-08-23T16:14:15Z

@ingvagabund , checking in OTA-1177

sferich888 · 2024-09-18T12:56:36Z

@ingvagabund by my count (before we add on layered products) the matrix your looking at is has 87k+ combinations in it.

>>> versions = ['4.12', '4.13', '4.14', '4.15', '4.16']
>>> IaaS_providers = ['Alibaba', 'AWS', 'Azure', 'Azure Stack Hub', 'GCP', 'IBM', 'Nutanix', 'BareMetal', 'OpenStack', 'Vsphere', 'OCI']
>>> Install_Method = ['IPI', 'UPI', 'Assisted Installer']
>>> Install_Mode = ['Connected', 'Disconnected']
>>> Deployment_Pattern = ['SingleNode', 'SingleNode+', '3C2W', '3C3I2W', '3CW', '3CW3I']
>>> ### C = Control Plane, W = Worker, I = Infrastructure, += Added workers
>>> Arch = ['x86_64', 'S390', 'ARM', 'Power']
>>> 
>>> from itertools import product
>>> 
>>> m_lists = [versions, IaaS_providers, Install_Method, Install_Mode, Deployment_Pattern, Arch, IaaS_providers]
>>> cp = list(product(*m_lists))
>>> len(cp)
87120

However when it come to must-gather and testing are we building a tool that works for the majority of our user base; I think the more important thing to consider is/are only about 9k of those combinations (or 10% of that matrix).

The biggest issues I have seen are related to operating at specific sizes and scales! IE: with our Deployment Patterns (combinations). We see the biggest challenges when must-gather can't find a host to run on (SingleNode Clusters or Clusters with scheduleable control planes (that are loaded with work), or has to crowd out a workload to start (people really don't like this; but its necessary). Or when we try and operate at large scales (500+ nodes; with workloads).

The biggest issues we see are with 'time to collect' data, and with how much data we collect (Note - we don't automatically compress archives (RFE for this; that hasn't been auctioned yet) - so we probably shouldn't make collection estimates based on compression). The size of our 'archive' is an issue for most customers; because they have to, in a lot of situations move the data from one system to another, just so that they can upload it to Red Hat, that is 2+ data transfers for many customers (mostly customers in disconnected or restricted network environments). Pared with the time to collect a must-gather (20+ min in some situations), we could have a customer collecting and transferring data for up to 30 to 40 min (based on some estimates).

sferich888 · 2024-09-18T16:35:08Z

/lgtm

openshift-ci · 2024-09-18T16:35:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund, oarribas, sferich888

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~collection-scripts/OWNERS~~ [sferich888]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-09-18T18:35:31Z

@oarribas: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ingvagabund · 2024-09-19T08:01:33Z

@sferich888 IaaS_providers is mentioned twice in m_lists. Is that on purpose?

openshift-bot · 2024-09-19T09:43:11Z

[ART PR BUILD NOTIFIER]

Distgit: ose-must-gather
This PR has been included in build ose-must-gather-container-v4.18.0-202409190709.p0.gab95e6a.assembly.stream.el9.
All builds following this will include this PR.

oarribas · 2024-09-19T10:46:21Z

/cherry-pick release-4.17

openshift-cherrypick-robot · 2024-09-19T10:47:03Z

@oarribas: new pull request created: #443

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Gather OSUS data (#1)

ebc6c81

Collect data from OSUS operator if installed in the cluster.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 12, 2024

openshift-ci bot requested review from davemulford and sferich888 April 12, 2024 10:27

wking reviewed Apr 12, 2024

View reviewed changes

Merge branch 'openshift:master' into master

7b26112

openshift-ci bot assigned ingvagabund Aug 23, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2024

openshift-ci bot assigned sferich888 Sep 18, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2024

sferich888 mentioned this pull request Sep 18, 2024

OPNET-470: Collect host networking logs #404

Open

openshift-merge-bot bot merged commit ab95e6a into openshift:master Sep 18, 2024
3 checks passed

openshift-cherrypick-robot mentioned this pull request Sep 19, 2024

[release-4.17] OTA-1177: Gather OSUS data #443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-1177: Gather OSUS data #416

OTA-1177: Gather OSUS data #416

oarribas commented Apr 12, 2024

openshift-ci-robot commented Apr 12, 2024 •

edited by openshift-ci bot

Loading

wking Apr 12, 2024 •

edited

Loading

oarribas Apr 15, 2024

oarribas Apr 30, 2024

sferich888 Jun 28, 2024

ingvagabund Jul 9, 2024

oarribas Jul 17, 2024

oarribas Aug 7, 2024

ingvagabund Aug 13, 2024 •

edited

Loading

oarribas Aug 21, 2024 •

edited

Loading

ingvagabund commented Aug 23, 2024

ingvagabund commented Aug 23, 2024

oarribas commented Aug 23, 2024 •

edited by openshift-ci bot

Loading

sferich888 commented Sep 18, 2024

sferich888 commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

ingvagabund commented Sep 19, 2024

openshift-bot commented Sep 19, 2024

oarribas commented Sep 19, 2024

openshift-cherrypick-robot commented Sep 19, 2024

script location	short description	condition	estimated size
/usr/bin/gather_aro	Gather ARO Cluster Data	`ns/openshift-azure-operator` or `ns/openshift-azure-logging` present	??
/usr/bin/gather_vsphere	Gather vSphere resources	vSphere CSI driver is installed	??
...	...	...	...

OTA-1177: Gather OSUS data #416

OTA-1177: Gather OSUS data #416

Conversation

oarribas commented Apr 12, 2024

openshift-ci-robot commented Apr 12, 2024 • edited by openshift-ci bot Loading

wking Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

oarribas Apr 15, 2024

Choose a reason for hiding this comment

oarribas Apr 30, 2024

Choose a reason for hiding this comment

sferich888 Jun 28, 2024

Choose a reason for hiding this comment

ingvagabund Jul 9, 2024

Choose a reason for hiding this comment

oarribas Jul 17, 2024

Choose a reason for hiding this comment

oarribas Aug 7, 2024

Choose a reason for hiding this comment

ingvagabund Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Extra collections

oarribas Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

ingvagabund commented Aug 23, 2024

ingvagabund commented Aug 23, 2024

oarribas commented Aug 23, 2024 • edited by openshift-ci bot Loading

sferich888 commented Sep 18, 2024

sferich888 commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

ingvagabund commented Sep 19, 2024

openshift-bot commented Sep 19, 2024

oarribas commented Sep 19, 2024

openshift-cherrypick-robot commented Sep 19, 2024

openshift-ci-robot commented Apr 12, 2024 •

edited by openshift-ci bot

Loading

wking Apr 12, 2024 •

edited

Loading

ingvagabund Aug 13, 2024 •

edited

Loading

oarribas Aug 21, 2024 •

edited

Loading

oarribas commented Aug 23, 2024 •

edited by openshift-ci bot

Loading