Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8s.{pod,node}.network.{io,errors} metrics #1427

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented Sep 24, 2024

Part of #1032

Fixes #1487

Changes

This PR adds the following metrics:

  • k8s.pod.network.io
  • k8s.pod.network.errors
  • k8s.node.network.io
  • k8s.node.network.errors

The above metrics are already used in the Collector in the kubeletstats receiver and the naming is aligned with the System metrics.

These metrics come with the following attributes network.io.direction and system.device which are already part of SemConv registry and are used in the System metrics. The attributes in the Collector will need to be adjusted to follow the recent SemConv rules.

Merge requirement checklist

instrument: counter
unit: "By"
attributes:
- ref: system.device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the equivalent of interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is designed based on the hostmetricsreceiver which will also need to be updated to the latest SemConv once the System metrics are stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm not really sure if that's a good and settled decision right now: #308 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #1492 makes it, we will use network.interface.name here.

Comment on lines +110 to +102
| [`network.io.direction`](/docs/attributes-registry/network.md) | string | The network IO operation direction. | `transmit` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`system.device`](/docs/attributes-registry/system.md) | string | The device identifier | `(identifier)` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save ourselves work later and to start preparing end users, can we start a list somewhere, maybe in the readme, of the breaking changes that will be included in the stable versions of these semantics?

I know this particular change is additive in this repo, but considering that the kubeletstatsreceiver has been emitting this metric with the attributes interface and direction since its inception this change will be very impactful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I wonder where this information would fit best. @open-telemetry/specs-semconv-maintainers @open-telemetry/specs-semconv-approvers any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ideally, we should be able to understand breaking changes and document them in Schema URL eventually for this purpose.

We don't have a good mechanism right now to do that, but @trask and @lmolkova have the most experience having stabilzied HTTP and soon, DB.

Copy link
Member Author

@ChrsMark ChrsMark Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ideally, we should be able to understand breaking changes and document them in Schema URL eventually for this purpose.

The thing here is that even if those are "fresh" metrics/attributes for the SemConv project, they are different compared to what we have so far in the Collector. So technically they are not breaking changes for SemConv. Will the Schema URL approach be able to support this?

Alternatively, one thing we can do for the Collector specifically is to file a meta issue that will collect all this breaking changes.
We could also update the k8s' components that will be affected to have their documentation explicitly mentioning this breaking change meta issue.

Copy link
Contributor

@lmolkova lmolkova Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually have a migration plan as a part of stabilization process - like this one

> [v1.24.0 of this document](https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/database/database-spans.md)
> (or prior):
>
> * SHOULD NOT change the version of the database conventions that they emit by default
> until the database semantic conventions are marked stable.
> Conventions include, but are not limited to, attributes,
> metric and span names, and unit of measure.
> * SHOULD introduce an environment variable `OTEL_SEMCONV_STABILITY_OPT_IN`
> in the existing major version which is a comma-separated list of values.
> If the list of values includes:
> * `database` - emit the new, stable database conventions,
> and stop emitting the old experimental database conventions
> that the instrumentation emitted previously.
> * `database/dup` - emit both the old and the stable database conventions,
> allowing for a seamless transition.
> * The default behavior (in the absence of one of these values) is to continue
> emitting whatever version of the old experimental database conventions
> the instrumentation was emitting previously.
> * Note: `database/dup` has higher precedence than `database` in case both values are present
> * SHOULD maintain (security patching at a minimum) the existing major version
> for at least six months after it starts emitting both sets of conventions.
> * SHOULD drop the environment variable in the next major version.

In a few words

  • the instrumentation should keep doing what it's doing now (following semconv or not)
  • it should have an option to enable new conventions (for collector I'd assume it won't be an env var, but it'd make sense to have a consistent way of opting in across different receivers).
  • instrumentation can drop old conventions or just switch to a new ones by default with the next major version.

We've been also creating migration guides like these https://github.com/open-telemetry/semantic-conventions/blob/main/docs/non-normative/http-migration.md or https://github.com/open-telemetry/semantic-conventions/blob/main/docs/non-normative/db-migration.md which outline the list of changes.

You can probably do something similar and just mention undocumented (in semconv) old attributes there along with their replacements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's for the information @lmolkova! If I understand correctly could we create a similar migration plan file for K8s already and start collecting the diffs there?

If I understood @TylerHelmuth initial comment here, we would like to start informing users already about such changes/diffs even if we are far from stability. In this way users could already be prepared about what changes to expect.

type: metric
metric_name: k8s.node.network.io
stability: experimental
brief: "Node network IO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please elaborate? Is it a number of bytes transmitted by the node?

type: metric
metric_name: k8s.node.network.errors
stability: experimental
brief: "Node network errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on this one? Is there some external documentation we can link, what kinds of errors does it cover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit: "{error}"
attributes:
- ref: system.device
- ref: network.io.direction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any information about the error we can include? We have a very vague error.type attribute intended for this purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any information about the errors' types. It seems these are aggregated errors over a specific interface: https://github.com/kubernetes/kubernetes/blob/v1.31.1/staging/src/k8s.io/kubelet/pkg/apis/stats/v1alpha1/types.go#L189-L204

type: metric
metric_name: k8s.pod.network.errors
stability: experimental
brief: "Pod network errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the node errors - could you please elaborate on what it means?

unit: "{error}"
attributes:
- ref: system.device
- ref: network.io.direction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we report any meaningful error.type here?

type: metric
metric_name: k8s.pod.network.io
stability: experimental
brief: "Pod network IO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
brief: "Pod network IO"
brief: "Number of bytes transmitted by the pod"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only transmitted but also received. That's why I guess we used IO in the kubeletstats' docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Is there something more elaborate we can say than "Pod network IO" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it Network bytes for the Pod. Guess it's more descriptive?

@ChrsMark ChrsMark requested review from a team and tigrannajaryan as code owners October 18, 2024 06:58
@ChrsMark ChrsMark requested review from a team and removed request for a team and tigrannajaryan October 18, 2024 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Status: In Discussions
Development

Successfully merging this pull request may close these issues.

[k8s] Define semantic conventions for k8s network metrics
5 participants