Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: adding kafka brokers metrics #196

Closed
wants to merge 10 commits into from

Conversation

jcountsNR
Copy link

No description provided.

@jcountsNR
Copy link
Author

Adding a note, this is a continuation of this PR, I just could not make changes to that fork so I needed to create a new one.

@jcountsNR
Copy link
Author

@jsuereth , I'm tagging you since it looks like you were the assignee on this one. We have a lot of eyes on our end wanting to get this one to the finish line, please let me know if there is anything I can do to make sure this one is good to go. Thanks!

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the metrics to another section ### Broker Metrics

docs/messaging/kafka.md Outdated Show resolved Hide resolved
docs/messaging/kafka.md Outdated Show resolved Hide resolved
@jcountsNR
Copy link
Author

Hi @dmitryax , I'd like to make some progress on this while I'm waiting on approval for the semantic conventions PR. Can you tell me what a CI failure is? I'm not seeing any actual error code is so I don't know what needs to be fixed.

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this PR documents what's already done on Kafka and there are limitations on what can be changed, but I suggest to stay as close to general otel metric requirements as possible.
I left several comments on this.

Also, I'd recommend using messaging.kafka.broker (not brokers) as a namespace.

| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to report it as a histogram? Then we'll have percentiles and messaging.kafka.brokers.requests.rate can also be derived from it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would much prefer to see this as a histogram. Even better, in my opinion, would be an exponential histogram.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't have any experience converting histograms from metrics.go into histograms for OTeL. Are there other examples where someone has done this in the past? The kafka consumers for example is already tracking lag as a gauge metric, so I thought the same would make sense here.

I could see it potentially being an upgrade, I'm just not sure how to apply it to update the collector.


| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- |
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this metric necessary?

assuming brokers report unique instance id (e.g. standard service.instance.id attribute), it can be derived from other metrics in this list. If brokers also report standard metrics (CPU, memory, etc), this can also be derived from them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be true, this is primarily here to replace the kafka.brokers metric which already exists. So doing away with this completely might impact metrics that are being collected now, although I do see the point that it could be derived if we added the attribute.

| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- |
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If brokers can report messaging.kafka.brokers.consumer.fetch.count, it can provide more information and the rate can be derived from it - can it be changed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I thought that it was preferred not to have rate metrics but instead use a counter from which the rate can be derived for whatever time period is desired.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I think that is correct. This should be updated.

| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, messaging.kafka.brokers.request.count is more flexible and the rate can be derived from it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | |
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be best represented with histogram to allow distribution and then a rate counter is not needed.
Another possibility would be to only report messaging.kafka.brokers.network.io with direction attribute or use messaging.kafka.brokers.request.bytes counter counting all the bytes .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lmolkova , is there an example of a histogram metric available and how that would be defined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can find examples throughout this repo, for example, take a look at http.client.request.duration. there is a lot of information in the spec repo and also take a look at metric docs on opentelemetry.io

| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |
| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | |
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | |
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on request rate

| messaging.kafka.brokers.requests.rate | Gauge | Double | requests per second | `{request}/s`| Average request rate per second. | | |
| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | |
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | |
| messaging.kafka.brokers.response_size | Gauge | Double | bytes | `By` | Average response size in bytes. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on request size

| messaging.kafka.brokers.requsts.size | Gauge | Double | bytes | `By` | Average request size in bytes. | | |
| messaging.kafka.brokers.responses.rate | Gauge | Double | responses per second| `{response}/s`| Average response rate per second. | | |
| messaging.kafka.brokers.response_size | Gauge | Double | bytes | `By` | Average response size in bytes. | | |
| messaging.kafka.brokers.requests.in.flight | Gauge | Int64 | requests | `{request}` | Requests in flight. | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe messaging.kafka.brokers.active_requests to stay consistent with HTTP semantic conventions.


**Description:** Kafka Broker level metrics.

| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |
Copy link
Contributor

@lmolkova lmolkova Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any attributes - are there any? We should have at least standard service.* ones

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the broker's id (which I would call "node id" instead of "broker id" following Kafka best practice).

| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |
| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- |
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I thought that it was preferred not to have rate metrics but instead use a counter from which the rate can be derived for whatever time period is desired.

| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |
| messaging.kafka.brokers.requests.latency | Gauge | Double | ms | `{ms}` | Average Request latency in ms. | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would much prefer to see this as a histogram. Even better, in my opinion, would be an exponential histogram.

| ---------------------------------------------| ------------- | ---------- | ------ | -------------------------------------------- | -------------- | ------------- | ---------------- |
| messaging.kafka.brokers.count | UpDownCounter | Int64 | brokers | `{broker}` | sum of brokers in the cluster | | |
| messaging.kafka.brokers.consumer.fetch.rate | Gauge | Double | fetches per second | `{fetch}/s` | Average consumer fetch Rate. | `state` | `in`, `out` |
| messaging.kafka.brokers.network.io | Counter | Int64 | bytes | `By` | The bytes received or sent by the broker. | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want to measure bytes sent and bytes received separately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original thought, but it was changed. I think changing it back makes sense, the PR I have for the kafka collector is separated.


**Description:** Kafka Broker level metrics.

| Name | Instrument | Value type | Unit | Unit ([UCUM](/docs/general/metrics.md#instrument-units)) | Description | Attribute Key | Attribute Values |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the broker's id (which I would call "node id" instead of "broker id" following Kafka best practice).

@jcountsNR
Copy link
Author

Hi @AndrewJSchofield @lmolkova @dmitryax . I made some updates and raised the commit. It looks correct to me now, let me know if I've missed anything.

@jcountsNR
Copy link
Author

One other follow up, I don't think that node and broker are interchangeable, and since this is specific to broker metrics I'm inclined to say we should leave it as broker.id. I did add the attribute to everything except broker count.

@jcountsNR
Copy link
Author

Sorry for a delayed response to this. This is for a very specific integration which uses sarama metrics, and it has some limitations. Since that is the case, I'm not sure that making significant updates to these metrics makes sense, because they aren't possible with the one integration so far that will be using them.

The broker related metrics for example don't use count, but rather rate for all of the metrics that they are translating for kafka. I think this PR should probably either be cancelled or updated to be a more general naming translation for the metrics we can get from confluent cloud brokers.

@pyohannes
Copy link
Contributor

In a semantic conventions SIG meeting an agreement was reached to remove Kafka broker metrics from the semantic conventions. See #338, which removes the PR and gives a list of reasons.

@jcountsNR, if you have any opinions please weigh in on #338. Otherwise let's close this PR when/if #338 is merged.

@jcountsNR
Copy link
Author

@pyohannes I'm good to close it, I don't think it's valid anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants