Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] internal k8s error messages are actionable for general users #675

Closed
3 of 13 tasks
slai opened this issue Jan 22, 2021 · 5 comments
Closed
3 of 13 tasks

[Feature] internal k8s error messages are actionable for general users #675

slai opened this issue Jan 22, 2021 · 5 comments
Labels
enhancement New feature or request stale untriaged This issues has not yet been looked at by the Maintainers

Comments

@slai
Copy link
Contributor

slai commented Jan 22, 2021

Motivation: Why do you think this is important?
When an error occurs with k8s starting or running a task, the resulting error message in Flyte Console is often cryptic and unactionable.

Some examples -

  • Pod reported success despite being OOMKilled, when a process in the task pod was OOMKilled but not pyflyte so task still completes. Reporting success despite a bad condition like OOMKilled is confusing - did the task succeed or not? What were the memory request/limits set? Which process was OOMKilled?
  • containers with unready status: [execution_id]|context deadline exceeded, generally when a container image cannot be pulled from the registry. This is a combination of an unclear k8s message (why is the container unready?) and a Go specific error that a general user wouldn't understand (context deadline exceeded, i.e. timed out waiting). What should the user do?
  • [3/3] currentAttempt done. Last Error: SYSTEM::object [execution_id] terminated in the background, manually, generally when the task pod was on an instance that was spot pre-empted or otherwise removed. What does 'currentAttempt done' mean, what does 'terminated in the background, manually' mean - what manually terminated it? This error is actually benign in most cases because the retry should succeed, but the message gives no indication of that

Goal: What should the final outcome look like, ideally?
The error message should strike a balance between -

  • being understandable for the general user (little k8s knowledge)
  • contain enough info for an engineer to troubleshoot
  • be actionable without being too prescriptive and eliminating other causes

For example, the first error message above could be -

Task ended with one or more processes killed due to lack of memory. Current memory request is X, limit is Y. Process killed was Z. Consider increasing the memory request/limit of this task.

Alternatively, maybe an error code that a user can look up elsewhere with more info would be a better way to keep the necessary k8s detail in one place, and the general user explanation in another.

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

Additional context
See https://flyte-org.slack.com/archives/CNMKCU6FR/p1611329101012800 for some further discussion. Also seems like #512 and #535 are similar issues.

Is this a blocker for you to adopt Flyte
Nope, already a user :)

@slai slai added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 22, 2021
@kumare3
Copy link
Contributor

kumare3 commented Jan 22, 2021

@slai thank you for this. I think the aim of Flyte should be to simplify the messaging so that a complicated technology is more approachable. I really appreciate this feedback, but I think this is a feature not a bug?

@slai slai changed the title [BUG] internal k8s error messages are cryptic for general users [Feature] internal k8s error messages are cryptic for general users Jan 24, 2021
@slai slai changed the title [Feature] internal k8s error messages are cryptic for general users [Feature] internal k8s error messages are actionable for general users Jan 24, 2021
@slai
Copy link
Contributor Author

slai commented Jan 24, 2021

@kumare3 good point, I've updated the title and description to a feature request.

@slai slai added enhancement New feature or request and removed bug Something isn't working labels Jan 24, 2021
palchicz pushed a commit to palchicz/flyte that referenced this issue Dec 23, 2021
* [wip] for feast demo

Signed-off-by: Ketan Umare <[email protected]>

* clean up a bit

Signed-off-by: Yee Hing Tong <[email protected]>

* add a test and move where constructor is called

Signed-off-by: Yee Hing Tong <[email protected]>

* remove unneeded import

Signed-off-by: Yee Hing Tong <[email protected]>

* add a part of a test

Signed-off-by: Yee Hing Tong <[email protected]>

* Added tests

Signed-off-by: Kevin Su <[email protected]>

* Fixed lint

Signed-off-by: Kevin Su <[email protected]>

* typo

Signed-off-by: Kevin Su <[email protected]>

Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Minor wording changes
Changed namedtuple to NamedTuple
Signed-off-by: SmritiSatyanV <[email protected]>
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 26, 2023
@github-actions
Copy link

github-actions bot commented Sep 2, 2023

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023
@eapolinario eapolinario reopened this Nov 2, 2023
@hamersaw
Copy link
Contributor

k8s event reporting is now available in the UI as of flyteorg/flytepropeller#600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

No branches or pull requests

4 participants