Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging - InMemoryMetricReader: MetricsTimeoutError('Timed out while flushing metric readers') #4196

Open
patricklubach opened this issue Sep 18, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@patricklubach
Copy link

Describe your environment

OS: Fedora 39
Python version: 3.12.5
SDK version: 1.27.0
API version: 1.27.0

Code is tested locally and runs in Google Cloud Run Functions (v2).

What happened?

I am working at the use case of exporting Google Cloud Apigee metrics via Apigee metrics API (https://cloud.google.com/apigee/docs/api-platform/analytics/use-analytics-api-measure-api-program-performance) and sending it to desired endpoints like Splunk, Dynatrace, etc.
So in short I request the metrics from the API, transform the data into OTLP format (adding it to the reader) and sending it to an OpenTelemetry collector instance that further exports it to the desired endpoint. Therefore I created a Google Cloud Function running a python script. This script also creates a report and stores it in another service. To do this I need to read the metrics data from memory before flushing/exporting it.

Steps to Reproduce

I use the following script to achieve the described goal:

import json
import logging
import math

from cloudevents.http import CloudEvent
import flask
import functions_framework
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    InMemoryMetricReader,
    PeriodicExportingMetricReader,
)
from opentelemetry.sdk.resources import Resource


def generate_report(data):
  """Generates report in Firebase

  Args:
      data: Otel metrics data
  """
  # ... lots of code here ...
  pass


def get_metrics_from_apigee():
  """Retrieves data from Apigee Metrics API."""
  # Gets data from Apigee metrics api
  # ... lots of code here ...
  pass


def transform_to_otlp(data) -> None:
  """Transforms given data into otlp format. Therefore creates instruments and add values. to global MeterProvider.
  """
  for item in data:
    meter = metrics.get_meter_provider().get_meter(item["name"])

    # Create counters, etc. and add values
    # ... lots of code here ...


@functions_framework.cloud_event
def entrypoint(cloud_event: CloudEvent) -> flask.typing.ResponseReturnValue:
    exporter_otlp_http = OTLPMetricExporter(endpoint="http://1.2.3.4/v1/metrics")
    reader_otlp_http = PeriodicExportingMetricReader(
        exporter=exporter_otlp_http,
        export_interval_millis=math.inf,
        export_timeout_millis=5000,
    )
    reader_otlp_inmemory = InMemoryMetricReader()
    readers = [reader_otlp_http, reader_otlp_inmemory]
    resource = Resource.create({
        "service.name": "apigee-metrics-exporter",
        "export_id": id
    })
    provider = MeterProvider(metric_readers=readers, resource=resource)
    metrics.set_meter_provider(provider)

    data = get_metrics_from_apigee()
    transform_to_otlp(data)

    metrics_data = json.loads(reader_otlp_inmemory.get_metrics_data().to_json())
    generate_report(metrics_data)


    if metrics.get_meter_provider().force_flush():
        logging.info("Exported metrics to otel successfully.")

This is more or less the code I use to transform the code into OTLP format.

Expected Result

During my local testing I can see no errors or exceptions thrown at me. Everything works as expected. So I deployed it to Google Cloud Functions (v2), triggered the function, the code was executed and I received an exception regarding the InMemoryMetricReader. Exception can be found down below.

Actual Result

The error I receive is as follows:

...
WARNING 2024-09-18T14:54:02.910493Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 1s.
WARNING 2024-09-18T14:54:03.997012Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 2s.
WARNING 2024-09-18T14:54:06.108241Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 4s.
WARNING 2024-09-18T14:54:10.199428Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 8s.
WARNING 2024-09-18T14:54:18.272264Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 16s.
WARNING 2024-09-18T14:54:34.367010Z Transient error Internal Server Error encountered while exporting metric batch, retrying in 32s.
ERROR 2024-09-18T14:55:06.372312Z Metrics could not be exported Traceback (most recent call last): File "/workspace/main.py", line 490, in entrypoint if metrics.get_meter_provider().force_flush(): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/layers/google.python.pip/pip/lib/python3.12/site-packages/opentelemetry/sdk/metrics/_internal/__init__.py", line 457, in force_flush raise Exception( Exception: MeterProvider.force_flush failed because the following metric readers failed during collect: InMemoryMetricReader: MetricsTimeoutError('Timed out while flushing metric readers')
...

I can see that the metric reader is running in a timeout but what I am missing is the reasoning. What is the metric reader waiting for? Is there the possibility to get more debug information to resolve this? As I said on my local machine it works but in Cloud Functions it does not anymore and I have no clue what changed. May you have some ideas how can I get more information why it runs in a timeout. Another thing I can tell, is that the PeriodicExportingMetricReader is sending the metrics data to my OpenTelemetry collector instance normally.

Additional context

No response

Would you like to implement a fix?

None

@patricklubach patricklubach added the bug Something isn't working label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant