Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: adding kafka brokers metrics #196

Closed
wants to merge 10 commits into from
17 changes: 16 additions & 1 deletion docs/messaging/kafka.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ This section defines how to apply semantic conventions when collecting Kafka met
| messaging.kafka.controllers.active | UpDownCounter | Int64 | controllers | `{controller}` | The number of active controllers in the broker. | | |
| messaging.kafka.leader.elections | Counter | Int64 | elections | `{election}` | Leader election rate (increasing values indicates broker failures). | | |
| messaging.kafka.leader.unclean-elections | Counter | Int64 | elections | `{election}` | Unclean leader election rate (increasing values indicates broker failures). | | |
| messaging.kafka.brokers | UpDownCounter | Int64 | brokers | `{broker}` | Number of brokers in the cluster. | | |
| messaging.kafka.topic.partitions | UpDownCounter | Int64 | partitions | `{partition}` | Number of partitions in topic. | `topic` | The ID (integer) of a topic |
| messaging.kafka.partition.current_offset | Gauge | Int64 | partition offset | `{partition offset}` | Current offset of partition of topic. | `topic` | The ID (integer) of a topic |
| | | | | | | `partition` | The number (integer) of the partition |
Expand Down Expand Up @@ -159,4 +158,20 @@ This section defines how to apply semantic conventions when collecting Kafka met
| messaging.kafka.consumer.lag_sum | Gauge | Int64 | lag sum | `{lag sum}` | Current approximate sum of consumer group lag across all partitions of topic | `group` | The ID (string) of a consumer group |
| | | | | | | `topic` | The ID (integer) of a topic |

### Broker Metrics

**Description:** Kafka Broker level metrics.

| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |
Copy link
Contributor

@lmolkova lmolkova Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any attributes - are there any? We should have at least standard service.* ones

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the broker's id (which I would call "node id" instead of "broker id" following Kafka best practice).

| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- |
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this metric necessary?

assuming brokers report unique instance id (e.g. standard service.instance.id attribute), it can be derived from other metrics in this list. If brokers also report standard metrics (CPU, memory, etc), this can also be derived from them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be true, this is primarily here to replace the kafka.brokers metric which already exists. So doing away with this completely might impact metrics that are being collected now, although I do see the point that it could be derived if we added the attribute.

| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If brokers can report messaging.kafka.brokers.consumer.fetch.count, it can provide more information and the rate can be derived from it - can it be changed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I thought that it was preferred not to have rate metrics but instead use a counter from which the rate can be derived for whatever time period is desired.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I think that is correct. This should be updated.

| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want to measure bytes sent and bytes received separately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original thought, but it was changed. I think changing it back makes sense, the PR I have for the kafka collector is separated.

| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to report it as a histogram? Then we'll have percentiles and messaging.kafka.brokers.requests.rate can also be derived from it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would much prefer to see this as a histogram. Even better, in my opinion, would be an exponential histogram.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't have any experience converting histograms from metrics.go into histograms for OTeL. Are there other examples where someone has done this in the past? The kafka consumers for example is already tracking lag as a gauge metric, so I thought the same would make sense here.

I could see it potentially being an upgrade, I'm just not sure how to apply it to update the collector.

| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, messaging.kafka.brokers.request.count is more flexible and the rate can be derived from it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be best represented with histogram to allow distribution and then a rate counter is not needed.
Another possibility would be to only report messaging.kafka.brokers.network.io with direction attribute or use messaging.kafka.brokers.request.bytes counter counting all the bytes .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lmolkova , is there an example of a histogram metric available and how that would be defined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can find examples throughout this repo, for example, take a look at http.client.request.duration. there is a lot of information in the spec repo and also take a look at metric docs on opentelemetry.io

| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on request rate

| messaging.kafka.brokers.response_size | Gauge | Double | bytes | `By` | Average response size in bytes. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on request size

| messaging.kafka.brokers.requests.in.flight | Gauge | Int64 | requests | `{request}` | Requests in flight. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe messaging.kafka.brokers.active_requests to stay consistent with HTTP semantic conventions.


[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/document-status.md