Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus exporter #1200

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

edomora97
Copy link
Contributor

Add a new command: cmsPrometheusExporter

This will start a new HTTP server that listens on a port (defaults to 8811) for requests to /metrics. At this endpoint many metrics about a contest (or all contests) are exposed for Prometheus to be collected. Those metrics consider both system values (queue status, connected workers, ...) and contest values (submissions, users, questions, ...)

This exporter works by executing simple queries to the database, and obtains the queue information asking Evaluation Service.

The metrics exposed may leak information about the contest, so it's important to secure this endpoint (for example using a reverse proxy, or binding localhost).

With this exporter it's possible to build interactive and real-time dashboard for monitoring the status of the contest. For example, you can find here the dashboard we used at OII 2020, and here the one used at an OIS round.
Those dashboards are built using Grafana, and we configured it also for sending telegram notifications when something wrong was happening (e.g. queue too long, see image below).

image

Add a new command: cmsPrometheusExporter

This will start a new HTTP server that listens on a port (defaults to
8811) for requests to /metrics. At this endpoint many metrics about a
contest (or all contests) are exposed for Prometheus to be collected.
Those metrics consider both system values (queue status, connected
workers, ...) and contest values (submissions, users, questions, ...)

The metrics exposed may leak information about the contest, so it's
important to secure this endpoint (for example using a reverse proxy, or
binding localhost).

With this exporter it's possible to build interactive and real-time
dashboard for monitoring the status of the contest. For example, you can
find here [1] the dashboard we used at OII 2020, and here [2] the one
used at an OIS round.

[1]: https://snapshot.raintank.io/dashboard/snapshot/0J29w0KruEiymy6zV30beX98aRG6njiX?orgId=2

[2]: https://snapshot.raintank.io/dashboard/snapshot/JngCBt71ZOJhiyEUC7yxZhS3guZmZYYi?orgId=2
Copy link
Member

@andreyv andreyv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great! The dashboard examples are nice.

I see that prometheus_client.start_http_server() uses threading, which should be patched by gevent — OK.

I noticed that:

  1. The service crashes on start if CMS (EvaluationService, to be precise) is not running.
  2. The service returns HTTP 500 if CMS/EvaluationService is stopped.

As a status reporting service, it is probably good to make it more robust to the state of other services.

super().__init__()

self.contest_id = args.contest_id
self.export_subs = not args.no_submissions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename export_subs and collect_subs to export_submissions and collect_submissions, respectively. It's only this field that is shortened, and the full name is more descriptive.

if self.export_users:
yield from self.collect_users(session)

def collect_subs(self, session):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, only the collect() method is the public interface. In this case the other method names should be prefixed with _.

service = PrometheusExporter(args)
REGISTRY.register(service)
start_http_server(args.port, addr=args.host)
print("Started at http://%s:%s/metric" % (args.host, args.port))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper logging should be used here.

)
metric.add_metric(["total"], len(status))

connected = len([1 for worker in status.values() if worker["connected"]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use generator expressions to avoid constructing the full list: sum(1 for … in … if …). Similarly below.

REGISTRY.register(service)
start_http_server(args.port, addr=args.host)
print("Started at http://%s:%s/metric" % (args.host, args.port))
Service.run(service)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not service.run()?

parser = argparse.ArgumentParser(description="Prometheus exporter.")
parser.add_argument(
"--host",
help="IP address to bind",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="IP address to bind",
help="IP address to bind to",

@@ -25,3 +25,6 @@ pyyaml>=5.3,<5.4 # http://pyyaml.org/wiki/PyYAML
# Only for printing:
pycups>=1.9,<1.10 # https://pypi.python.org/pypi/pycups
PyPDF2>=1.26,<1.27 # https://github.com/mstamy2/PyPDF2/blob/master/CHANGELOG

# Only for cmsPrometheusExporter
prometheus-client>=0.7,<0.8 # https://pypi.org/project/prometheus-client
Copy link
Member

@andreyv andreyv Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation should also be updated at "Installing CMS and its Python dependencies". We need python3-prometheus-client for Ubuntu and python-prometheus_client for Arch Linux as optional deps.

@@ -0,0 +1,293 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script file should have the executable permission.

Comment on lines +286 to +288
REGISTRY.register(service)
start_http_server(args.port, addr=args.host)
print("Started at http://%s:%s/metric" % (args.host, args.port))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code could be moved to a new method in PrometheusExporter. In fact, PrometheusExporter.run() could also be overridden and this new method be called from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants