Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1177: Gather OSUS data #416

Merged
merged 2 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions collection-scripts/gather
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,10 @@ pids+=($!)
/usr/bin/gather_ppc &
pids+=($!)

# Gather OSUS information
/usr/bin/gather_osus &
pids+=($!)

# Check if PID array has any values, if so, wait for them to finish
if [ ${#pids[@]} -ne 0 ]; then
echo "Waiting on subprocesses to finish execution."
Expand Down
26 changes: 26 additions & 0 deletions collection-scripts/gather_osus
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

source $(dirname "$0")/common.sh
BASE_COLLECTION_PATH="must-gather"
OSUS_OPERATOR_NAME="update-service-operator"
get_log_collection_args

HAS_OSUS=$(oc get csv -A --no-headers -o custom-columns=NS:.metadata.namespace,OPERATOR:.metadata.name --ignore-not-found=true | awk '/'${OSUS_OPERATOR_NAME}'/ {print $1}')
Copy link
Member

@wking wking Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a hard direction for an OCP-central must-gather to support. I thought the OLM-installed operator pattern was for each operator to ship a separate image with the must-gather-for-them logic (as described in the KCS you'd linked from OTA-1177, and also in these docs), so the central must-gather maintainers didn't have to be bothered reviewing gather logic for all the many, many, OLM-installed operators that could possibly be present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking , while I agree with your comment for most of the operators that can be installed in OpenShift, I think creating a full image for this operator is excessive.
Other operators provide extra capacities to the cluster, and usually involve several different CRDs and even several namespaces. This operator helps with one of the core capacities of OpenShift, which is the upgrade (in this case, for disconnected clusters), and lots of cases already includes a must-gather proactively. The info for this operator (when installed) shouldn't increase a lot the size of the must-gather, and will avoid asking for a new different must-gather.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking , any thoughts on the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oarribas / @wking the only thing this collects are the details from a given namespace[s]. I think this is fine (given its related to update); but @soltysh or @ingvagabund should have the final say if this is provided by this image or an independent image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oarribas how big are the extra manifests? Can you share an example of running both commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , checking data from some cases, it depends a lot on the volume of the logs from the pods.
Largest inspect of the namespace I have seen, without compression is 30MB, and compressed between 2-3MB. The size of the updateservice resource is few KB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , @soltysh , any thoughts based on the above?

Copy link
Member

@ingvagabund ingvagabund Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some time we have been merging extra collection with a promise of collecting extra "small" data that help to avoid asking customers to run yet another must gather image. More relevant in the disconnected environment. Yet, we don't track how much of these extra collections increase the overall must-gather size on average. Before going further I'd like to see when all the extra collections are collected. E.g. what's the estimated worst case extra bump in the collected data. A new section under https://github.com/openshift/must-gather/blob/master/README.md will do. E.g.

Extra collections

script location short description condition estimated size
/usr/bin/gather_aro Gather ARO Cluster Data ns/openshift-azure-operator or ns/openshift-azure-logging present ??
/usr/bin/gather_vsphere Gather vSphere resources vSphere CSI driver is installed ??
... ... ... ...

Copy link
Contributor Author

@oarribas oarribas Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund , the estimated size for this one (for the OSUS operator), when compressed, is around 3MB (usually less as per several inspect of the openshift-update-service namespace I have reviewed). And it's only collected if OSUS is installed.

Regarding the above for the estimated "worst" case, things like ARO and vSphere cannot be collected at the same time if the conditions are OK.

if [ -z "$HAS_OSUS" ]; then
exit 0
fi

function gather_osus() {
echo "INFO: Collecting OSUS Data"
declare -a OSUS_NAMESPACES=(${HAS_OSUS})

for OSUS_NAMESPACE in "${OSUS_NAMESPACES[@]}"; do
oc adm inspect ${log_collection_args} --dest-dir "${BASE_COLLECTION_PATH}" "ns/${OSUS_NAMESPACE}"
oc adm inspect --dest-dir "${BASE_COLLECTION_PATH}" updateservices --namespace "${OSUS_NAMESPACE}"
done
}

gather_osus

# force disk flush to ensure that all data gathered are written
sync