Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD-22507: New IAM credential, secret rotate util #601

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nephomaniac
Copy link

@nephomaniac nephomaniac commented Jul 23, 2024

First pass at OSD-22507. Attempting to consolidate previous bash, osdctl, additional SOP steps to into a single command/context to help users rotate Osd Admin user(s)' IAM creds and related secrets.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 23, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 23, 2024

@nephomaniac: This pull request references OSD-22507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

First pass at OSD-22507. Attempting to consolidate previous bash, osdctl, additional SOP steps to into a single command/context to help users rotate Osd Admin user(s)' IAM creds and related secrets.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from fahlmant and mjlshen July 23, 2024 20:56
Copy link
Contributor

openshift-ci bot commented Jul 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nephomaniac
Once this PR has been reviewed and has the lgtm label, please assign iamkirkbater for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nephomaniac nephomaniac marked this pull request as draft July 24, 2024 19:47
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 24, 2024
Copy link
Contributor

@iamkirkbater iamkirkbater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks very good! Got a few comments, but I don't think any of them are MAJOR changes, like I mentioned at standup some of them are little nits or things like expanding on what something is with a comment because it doesn't make sense at first glance.

rotateAWSCredsCmd.Flags().BoolVar(&ops.updateCcsCredsCli, "rotate-ccs-admin", false, "Rotate osdCcsAdmin user credentials. Interactive. Use caution!")
rotateAWSCredsCmd.Flags().BoolVar(&ops.describeKeysCli, "describe-keys", false, "Print AWS AccessKey info for osdManagedAdmin and osdCcsAdmin relevant cred rotation, and exit")
rotateAWSCredsCmd.Flags().BoolVar(&ops.describeSecretsCli, "describe-secrets", false, "Print AWS CredentialRequests ref'd secrets info relecant to cred rotation, and exit")
rotateAWSCredsCmd.Flags().BoolVar(&ops.saveSecretKeyToFile, "save-keys", false, "Save 'newly created' secret access key contents stdout output during execution")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would an SRE use this portion of the command, or was this for testing purposes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed/replaced the 'save-keys' option with a no-save option to negate saving a yaml copy of the secret/artifact to /tmp upon update() failure.

fmt.Println("Invalid input. Expecting (y)es or (N)o or 'all'")
}
}
if userInput == "y" || userInput == "all" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this check here since a no will return and exit and the default case ends up just restarting the loop for another prompt if there's an invalid input.

If we want to have a bit of explicit-ness here, we could add a continue to the default case in the switch above which would force it to go back to the top of the loop in case someone put something else below that in the block, but since this specific if check is outside of the for loop I think we'll be safe to remove this if statement entirely.

We could further simplify this by changing userInput to be boolean variable named something like rotateAllWithoutPrompts, default it to false and then only in the case "all" we set that to false and check for that on line 515.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Removed, cleaned this up in latest commit.

fmt.Fprintf(w, "\tSecret Namespace:\t'%s'\n", secret.Namespace)
fmt.Fprintf(w, "\tCreate:\t'%v'\n", secret.CreationTimestamp)
w.Flush()
fmt.Printf("Delete Secret ('%s'), ", secret.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a semantic reason for this not being part of the writer before it's flushed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait - I see this is printing the prompt for the confirmation. Should this be a question then? Or should this only be present if the userInput != "all"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated this section. Separate print/flush is/was purely for the writer's formatting purposes.

Comment on lines 683 to 685
// TODO: Can this differ from the login token needed earlier if the cluster
// lives in another env (ie integration or staging) (?) If so this needs to allow a 2 step process with
// multiple ocm login tokens(?), 1 for each env.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An OCM token should be valid for every environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hive access (prod) is required, this may need more work to allow testing against a cluster running on staging/integration. The utils seem to assume a single ocm config context at the moment (?).

}
respBody := liveResponse.Body().Resources()
if awsAccountClaim, ok := respBody["aws_account_claim"]; ok {
var claimJson map[string]interface{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth exploring pulling in the Account Claim structs here instead of using a map: https://github.com/openshift/aws-account-operator/blob/master/api/v1alpha1/accountclaim_types.go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks pulled in the account Claim struct. This function can/should likely be removed as it's currently only a backup to osdctl's k8s.GetAccountClaimFromClusterID() (added afterward).

}
var inUseStr string = "unknown?"
p := printer.NewTablePrinter(o.IOStreams.Out, 3, 1, 3, ' ')
headers := []string{"#", "ACCESS-KEY-ID", "CR-IN-USE", "CREATED", "STATUS", "IAM-USER"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is CR-IN-USE supposed to do here? I guess maybe expand on this with a comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to explain the 'IN-USE' column/attribute indicates the current key referenced by the account CR.

return err
}
// Allow user to interactively delete keys if user has the max.
// Todo: Should this be presented when a user has less than max too?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue no - if there's less than the max we shouldn't need to worry about deleting any.

Then again - I don't think there's even a case to keep two - if we're rotating the creds then either they've been compromised (and therefore we'd want to rotate both out of caution anyway) or we've been asked to as a pre-emptive measure, in which case we should only have one.

I wouldn't change this for now, but let's keep this on our radar of something we may want to come back to later.

// Allow user to interactively delete keys if user has the max.
// Todo: Should this be presented when a user has less than max too?
if len(currKeys.AccessKeyMetadata) >= 2 {
// TODO: Should this script provide an option to delete the oldest key for the user? Too risky?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above - I think ALL keys should be rotated with this.

checkUser, err := o.awsClient.GetUser(&iam.GetUserInput{UserName: awsSdk.String(osdManagedAdminUsername)})
var nse *iamTypes.NoSuchEntityException
if (err != nil && errors.As(err, &nse)) || checkUser == nil {
o.log.Warn(o.ctx, "User Not Found: '%s', trying user:'%s' instead...\n", osdManagedAdminUsername, common.OSDManagedAdminIAM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth changing this to be an Info message - it's an edge case but it's not so uncommon to warrant a Warning which may be misinterpreted as something being wrong when in reality it's just the age of the account.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to INFO level

Comment on lines 1111 to 1120
// Todo: Make displaying the secret a cli provided arg? If this is useful for error/script failure recovery?
if o.saveSecretKeyToFile {
err := os.WriteFile(saveFileManaged, []byte(*createAccessKeyOutput.AccessKey.SecretAccessKey), 0600)
if err != nil {
o.log.Error(o.ctx, "Error! Saving key info to file:'%s', err:%s", saveFileManaged, err)
//return err
} else {
fmt.Printf("\nSaved key to: '%s'\n", saveFileManaged)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be better to add after the attempt to update the secret failed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - what might be better would be to have the secret attempt to be created, and then if it fails we just write the whole secret object's yaml to a tmp file with a random name (instead of being user-provided) and then the SRE would be able to just oc apply -f /tmp/whatever if there's something wrong, the data would already be base64 encoded and ready to run from there.

Then we could exit with an error, similar to when you oc edit [object] without permissions, it gives you an error and says you can re-run this command with 'oc replace -f /tmp/randomfilepath' to apply

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current behavior is to build and save the yaml representation of the secret in /tmp and print the path to this file(s). These can be used for apply -f or just review later. This will only happen if the key(s) have been created and something fails during the secret update().

@nephomaniac
Copy link
Author

Big thanks @iamkirkbater for the review and comments. Most recent commit should contain these suggestions as well as support -o yaml/json to stdout (logs to stderr) for the 'describe' specific commands.

@nephomaniac nephomaniac marked this pull request as ready for review August 13, 2024 22:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2024
@openshift-ci openshift-ci bot requested a review from typeid August 13, 2024 22:54
@nephomaniac
Copy link
Author

/test lint

@nephomaniac nephomaniac changed the title OSD-22507: POC first pass at new rotate command (draft) OSD-22507: POC first pass at new rotate command Aug 30, 2024
Copy link
Member

@clcollins clcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally super small comments/spelling suggestions. I have one concern with saving the keys that might just be me being too paranoid, and a suggestion about how to search for the credentialsRequests.

Otherwise great work! This is WAY more than I expected it to be when I wrote OSD-22507!

if !o.updateCcsCredsCli && !o.updateMgmtCredsCli && !o.describeKeysCli && !o.describeSecretsCli {
return cmdutil.UsageErrorf(cmd, "must provide one or more actions: ('--rotate-managed-admin' and/or '--rotate-ccs-admin'), or '--describe-secrets', or '--describe-keys'")
}
if (o.updateCcsCredsCli || o.updateMgmtCredsCli) && (o.describeKeysCli || o.describeSecretsCli) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cobra has "Flag Groups" that you can use to mark mutually exclusive options.

Example: examples,
DisableAutoGenTag: true,
Run: func(cmd *cobra.Command, args []string) {
cmdutil.CheckErr(ops.preRunCliChecks(cmd, args))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ops.preRunCliChecks might make more sense in the context of the Cobra PreRun: or PreRunE:.

return nil
}
// Display cluster info, let user confirm they are on the correct cluster, etc...
o.printClusterInfo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the cluster info already been printed by err = o.printClusterInfo() on line 239 already?

return err
}
}
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//

o.log.Warn(o.ctx, "fetchCredentialsRequests returned err:'%s'\n", err)
return err
}
fmt.Printf("\nOptions --describe-keys, --describe-secrets can be used to provide additional info/status\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use a mix of the logger and fmt.PrintX for output through here. Is that intentional, given this is a CLI tool and there's not a logfile/external logging.

(Genuinely asking - I have struggled with some decision making in ocm-container along the same lines).

// lives in another env (ie integration or staging) (?) If so this needs to allow a 2 step process with
// multiple ocm login tokens(?), 1 for each env.
// Logging into Hive Shard here (equiv of 'ocm backplane login $HIVESHARD')
// Is specifially creating the hive connection here a better/worse option than connecting via 'ocm backplane'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Is specifially creating the hive connection here a better/worse option than connecting via 'ocm backplane'
// Is specifically creating the hive connection here a better/worse option than connecting via 'ocm backplane'

// multiple ocm login tokens(?), 1 for each env.
// Logging into Hive Shard here (equiv of 'ocm backplane login $HIVESHARD')
// Is specifially creating the hive connection here a better/worse option than connecting via 'ocm backplane'
// externally, and using the LazyClient (ie o.kubeCli) instead?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - I think we should do all the hive connections required for this operation via code in these functions, because we can ensure everything (env, logged in clusters, etc) and there's no other interaction with (or need for) an external client.

var err error

if len(o.accountIDSuffixLabel) <= 0 {
return fmt.Errorf("doAwsCredRoteate() empty required accountIDSuffixLabel")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("doAwsCredRoteate() empty required accountIDSuffixLabel")
return fmt.Errorf("doAwsCredRotate() empty required accountIDSuffixLabel")

o.log.Warn(o.ctx, "%v", err)
return "", err
}
saveFile, err := os.CreateTemp(os.TempDir(), string(*accessKey.UserName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it overkill to add a chmod 600? Probably right? Something about saving the secret locally makes me uncomfortable, but I dunno if that's reasonable or not.

Lemme tractor-beam @wshearn for Team Security's take.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the event of an error like this it could be displayed on-screen instead? I dunno, I'm probably being over-paranoid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input! I agree. Knowing the flow better, I dont think there's a lot of benefit to this upon failure either. I'll remove all of this. In terms of recovering from errors during the key rotation portion, I've added some retry logic around the syncset portion (per potential syncset collisions noted in the original SOP).

}
o.log.Debug(o.ctx, "Successfully updated secrets for '%s'\n", userName)
} else {
// Thi is a secondary check, and should have also been performed early on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Thi is a secondary check, and should have also been performed early on.
// This is a secondary check, and should have also been performed early on.

Copy link
Contributor

openshift-ci bot commented Sep 9, 2024

@nephomaniac: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nephomaniac nephomaniac changed the title OSD-22507: POC first pass at new rotate command OSD-22507: New IAM credential, secret rotate util Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants