Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1177: Gather OSUS data #416

Merged
merged 2 commits into from
Sep 18, 2024
Merged

Conversation

oarribas
Copy link
Contributor

Collect data from OSUS operator if installed in the cluster.

Collect data from OSUS operator if installed in the cluster.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 12, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 12, 2024

@oarribas: This pull request references OTA-1177 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Collect data from OSUS operator if installed in the cluster.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

OSUS_OPERATOR_NAME="update-service-operator"
get_log_collection_args

HAS_OSUS=$(oc get csv -A --no-headers -o custom-columns=NS:.metadata.namespace,OPERATOR:.metadata.name --ignore-not-found=true | awk '/'${OSUS_OPERATOR_NAME}'/ {print $1}')
Copy link
Member

@wking wking Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a hard direction for an OCP-central must-gather to support. I thought the OLM-installed operator pattern was for each operator to ship a separate image with the must-gather-for-them logic (as described in the KCS you'd linked from OTA-1177, and also in these docs), so the central must-gather maintainers didn't have to be bothered reviewing gather logic for all the many, many, OLM-installed operators that could possibly be present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking , while I agree with your comment for most of the operators that can be installed in OpenShift, I think creating a full image for this operator is excessive.
Other operators provide extra capacities to the cluster, and usually involve several different CRDs and even several namespaces. This operator helps with one of the core capacities of OpenShift, which is the upgrade (in this case, for disconnected clusters), and lots of cases already includes a must-gather proactively. The info for this operator (when installed) shouldn't increase a lot the size of the must-gather, and will avoid asking for a new different must-gather.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking , any thoughts on the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oarribas / @wking the only thing this collects are the details from a given namespace[s]. I think this is fine (given its related to update); but @soltysh or @ingvagabund should have the final say if this is provided by this image or an independent image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oarribas how big are the extra manifests? Can you share an example of running both commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , checking data from some cases, it depends a lot on the volume of the logs from the pods.
Largest inspect of the namespace I have seen, without compression is 30MB, and compressed between 2-3MB. The size of the updateservice resource is few KB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , @soltysh , any thoughts based on the above?

Copy link
Member

@ingvagabund ingvagabund Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some time we have been merging extra collection with a promise of collecting extra "small" data that help to avoid asking customers to run yet another must gather image. More relevant in the disconnected environment. Yet, we don't track how much of these extra collections increase the overall must-gather size on average. Before going further I'd like to see when all the extra collections are collected. E.g. what's the estimated worst case extra bump in the collected data. A new section under https://github.com/openshift/must-gather/blob/master/README.md will do. E.g.

Extra collections

script location short description condition estimated size
/usr/bin/gather_aro Gather ARO Cluster Data ns/openshift-azure-operator or ns/openshift-azure-logging present ??
/usr/bin/gather_vsphere Gather vSphere resources vSphere CSI driver is installed ??
... ... ... ...

Copy link
Contributor Author

@oarribas oarribas Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , the estimated size for this one (for the OSUS operator), when compressed, is around 3MB (usually less as per several inspect of the openshift-update-service namespace I have reviewed). And it's only collected if OSUS is installed.

Regarding the above for the estimated "worst" case, things like ARO and vSphere cannot be collected at the same time if the conditions are OK.

@ingvagabund
Copy link
Member

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2024
@ingvagabund
Copy link
Member

@kasturinarra just for curiosity do we have any test case/jobs monitoring how much an average must-gather image grows in time in context of various installations?

@oarribas are there any statistics about must-gather size? E.g. a matrix of which operators are installed -> how much data can be collected. Or, what are the variable parts that can significantly increase the size?

@sferich888 is it possible to make a matrix of all flavors of an OCP cluster? Including layered products? To see how complex a must-gather gathering can be? I am quite blind in here. Would like to extend my perspective so we can make better decisions when reviewing this kind of additions.

@oarribas
Copy link
Contributor Author

oarribas commented Aug 23, 2024

@ingvagabund , checking in OTA-1177

@sferich888
Copy link
Contributor

@ingvagabund by my count (before we add on layered products) the matrix your looking at is has 87k+ combinations in it.

>>> versions = ['4.12', '4.13', '4.14', '4.15', '4.16']
>>> IaaS_providers = ['Alibaba', 'AWS', 'Azure', 'Azure Stack Hub', 'GCP', 'IBM', 'Nutanix', 'BareMetal', 'OpenStack', 'Vsphere', 'OCI']
>>> Install_Method = ['IPI', 'UPI', 'Assisted Installer']
>>> Install_Mode = ['Connected', 'Disconnected']
>>> Deployment_Pattern = ['SingleNode', 'SingleNode+', '3C2W', '3C3I2W', '3CW', '3CW3I']
>>> ### C = Control Plane, W = Worker, I = Infrastructure, += Added workers
>>> Arch = ['x86_64', 'S390', 'ARM', 'Power']
>>> 
>>> from itertools import product
>>> 
>>> m_lists = [versions, IaaS_providers, Install_Method, Install_Mode, Deployment_Pattern, Arch, IaaS_providers]
>>> cp = list(product(*m_lists))
>>> len(cp)
87120

However when it come to must-gather and testing are we building a tool that works for the majority of our user base; I think the more important thing to consider is/are only about 9k of those combinations (or 10% of that matrix).

The biggest issues I have seen are related to operating at specific sizes and scales! IE: with our Deployment Patterns (combinations). We see the biggest challenges when must-gather can't find a host to run on (SingleNode Clusters or Clusters with scheduleable control planes (that are loaded with work), or has to crowd out a workload to start (people really don't like this; but its necessary). Or when we try and operate at large scales (500+ nodes; with workloads).

The biggest issues we see are with 'time to collect' data, and with how much data we collect (Note - we don't automatically compress archives (RFE for this; that hasn't been auctioned yet) - so we probably shouldn't make collection estimates based on compression). The size of our 'archive' is an issue for most customers; because they have to, in a lot of situations move the data from one system to another, just so that they can upload it to Red Hat, that is 2+ data transfers for many customers (mostly customers in disconnected or restricted network environments). Pared with the time to collect a must-gather (20+ min in some situations), we could have a customer collecting and transferring data for up to 30 to 40 min (based on some estimates).

@sferich888
Copy link
Contributor

/lgtm

Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund, oarribas, sferich888

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2024
Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

@oarribas: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ab95e6a into openshift:master Sep 18, 2024
3 checks passed
@ingvagabund
Copy link
Member

@sferich888 IaaS_providers is mentioned twice in m_lists. Is that on purpose?

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-must-gather
This PR has been included in build ose-must-gather-container-v4.18.0-202409190709.p0.gab95e6a.assembly.stream.el9.
All builds following this will include this PR.

@oarribas
Copy link
Contributor Author

/cherry-pick release-4.17

@openshift-cherrypick-robot

@oarribas: new pull request created: #443

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants