Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slack notifier for failed step functions #554

Open
alexiswl opened this issue Sep 16, 2024 · 4 comments
Open

Add slack notifier for failed step functions #554

alexiswl opened this issue Sep 16, 2024 · 4 comments
Assignees
Labels
feature New feature

Comments

@alexiswl
Copy link
Member

For production workloads it's important we know if the glue fails.

TODO

Getting an example of a failed state event from AWS CloudTrail doesn't seem to be straightforward, can only find StartExecution states.

Simply slack notifier logic can be found here - https://github.com/umccr/infrastructure/blob/master/cdk/apps/icav2_credentials/lambdas/slack_notifier/notify_slack.py

@alexiswl alexiswl added the feature New feature label Sep 16, 2024
@alexiswl alexiswl self-assigned this Sep 16, 2024
@victorskl
Copy link
Member

Please run past check with @reisingerf on this, please. Flo has some idea...

@reisingerf
Copy link
Member

Hm, ... is that not something the StepFunction could / should do itself?
E.g. if the stepfunction completes correctly, it pushes out a WRSC event with SUCCEEDED, if it fails (in any state) it can push out a WRSC with FAILED (and an appropriate / useful payload)?

@alexiswl
Copy link
Member Author

if it fails (in any state) it can push out a WRSC with FAILED (and an appropriate / useful payload)?

Would it not need to then capture a potential failure in every task? I'm not sure it'd be able to capture everything.
I saw this more for step functions failing in ways we don't expect them to fail (like the secrets rotator notifications), not for if a workflow run has gone rogue.

@reisingerf
Copy link
Member

Would it not need to then capture a potential failure in every task?

Maybe... I am not sure there is a global catch for the stepfunction as a whole. Could wrap the actual SF into a wrapper SF with a single step that has a catch... but ugly...

step functions failing in ways we don't expect them to fail

Is it important to then know that a SF has failed, or more generally that an analysis has not progressed as it should have?
(the second part should be covered as soon as we pre-generate workflow runs. We'd know what is supposed to run and we can check which of those have not progressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

3 participants