Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken metrics part 1 #31

Merged
merged 81 commits into from
Oct 25, 2023
Merged

Fix broken metrics part 1 #31

merged 81 commits into from
Oct 25, 2023

Conversation

ximon18
Copy link
Member

@ximon18 ximon18 commented Oct 10, 2023

This PR replaces accidentally closed PR #27.

There will be a "part 2" soon, but for now this PR is stopping here.


Add unit or integration tests (as relevant) for all metrics showing that they update as expected.

Status:

  • BgpTcpInMetrics: ✅ (incomplete)
  • BmpInMetrics: ✅
  • BmpMetrics: TODO
  • BmpProxyMetrics: TODO
  • BmpTcpInMetrics: TODO
  • GateMetrics: TODO
  • MqttMetrics: TODO
  • RibUnitMetrics: TODO
  • RibMergeUpdateStatistics: TODO
  • RotoFilterMetrics: ✅
  • TokioTaskMetrics: TODO
  • TrackingAllocator: ✅

This PR fixes the following issues:

  • Fixes an issue where metrics for a router ID may not exist and cause a HTTP 404 Not Found error in the HTML UI.
  • Fixes an issue where a BMP re-initiate with a new sysName / sysDesc may not be detected.
  • Fixes an issue where a non-fatal I/O error would wrongly terminate a BMP input stream.
  • Fixes a typo in metric name bgp_tcp_in_diconnect_count.

And this PR also:

  • Introduces Target::Test for obtaining metric snapshots in a form more conducive to testing (only compiled in test mode, not production mode).
  • Introduces fn AnyStatusReporter::metrics() for easier access to metrics for testing.
  • Extends Filterable::filter() to take a filtered_fn(SourceId) callback to make it easier to know when payloads are filtered out, especially when filtering on a collection of payloads rather than a single payload.
  • Marks metrics with a // TEST STATUS: [ ] makes sense? [ ] passes tests? style comment to track which metrics have been assessed as to whether they actually make sense or not and whether or not they are covered by tests that pass.
  • Where needed code has been refactored to make it testable.

13/78 TEST STATUS comments are "green", so 65/78 remain to do, although not all are equally important.

…, fix typos and make various improvements to the config file text and structure.
When run without --config the embedded copy of etc/rotonda.conf is used instead, and targets.proxy is removed if --proxy-destination is not supplied on the command line.

If --bgp-listen or --bmp-listen are supplied on the command line they override the BMP and BGP listen settings in the embedded config.

- Allow BGP peers to be missing.
  - Add missing semi-colon.
  - Filter out not in.
…age can link to the actual RIB query HTTP API rather than incorrectly assume a fixed path of /prefixes/.
…ined in roto scripts loaded from a roto script directory.

- Refactors roto script file reading from the units/targets to the Manager.

- Checks script loading, compilation and valid filter names at config load time instead of at filter execution time.

- Add MVP behaviour tests (WIP).

- Introduces a FilterName type.
…e required .roto files, e.g. if running from `cargo install`. Cleanup config initialisation and standard variable naming.
…ts_dir` dir `etc` doesn't exist.

This change had several knock on effects:
- Introduced RotoError::LoadError.
- Upgrade clap crate to fix bug with clap switcing help on own line or not even when disabled.
- Upgrade toml crate to fix inconsistent line breaks in diagnostic dump of post-processed config TOML.
- Adjust position handling in `Marked` due to changes in serde_spanned pulled in by other crate upgrades.
- Added new cmd line arg --print-config-and-exit and introduced `Terminated` for use in cases where `ExitError` wrongly limits us to just an error exit rather than a normal exit.
- Various minor tweaks to the clap config for improved read and usability.
- Log when exiting.
@ximon18 ximon18 changed the base branch from blocks-wip to fix-broken-bmp-routers-html-links October 12, 2023 07:45
Base automatically changed from fix-broken-bmp-routers-html-links to blocks-wip October 12, 2023 07:55
@ximon18 ximon18 changed the title Fix broken metrics Fix broken metrics part 1 Oct 12, 2023
@ximon18 ximon18 marked this pull request as ready for review October 12, 2023 13:15
@ximon18 ximon18 merged commit e250a6d into blocks-wip Oct 25, 2023
1 check failed
@ximon18 ximon18 deleted the fix-broken-metrics branch October 25, 2023 20:10
density215 pushed a commit that referenced this pull request Dec 8, 2023
* Add a special metric output format for use by unit tests, and a callback fired on each VM exec that rejects the input (to enable easy counting of filtered out input messages), and remove support for metric types we don't need. Use the new callback to invoke the till now unused message_filtered() status reporter fn.

* Extend filter unit tests to show that the filtered message count metric now works.

* Remove the bmp_in_connection_count metric as connections are handled by bmp_tcp_in while this metric is based on BMP initiation messages received, so this is (a) misleading and (b) doesn't work anyway as when the sys name changes so can the router id and thus the changed metric would be in a different metric set, there'd be no effect on the previous metric.

* Follow changes in router id caused by a new sysName received by a subsequent BMP Initiation Message.

* Make test metrics easier to query, and add initial tests for the one and only remaining bmp-in metric. TODO: add a test showing bmp_in_num_invalid_bmp_messages increasing.

* Remove the footgun of having to initialiize metrics per router, just ensure they are initialized on first use, otherwise metrics get lost.

* Add a test showing that invalid BMP messages cause the bmp_in_num_invalid_bmp_messages counter to increase.

* Test the custom allocator metrics.

* Refactor bgp_tcp_in to be testable and add first metric test.

* Sort test metric output for easier reading when inspecting the contents manually during development.

* Metric name typo correction.

* Mock the TcpStream as well so that we can accept a simulated connection to test the `bgp_tcp_in_connection_accepted_count` metric.

* Also test the connection lost and disconnected bgp-tcp-in metrics by using a mock BGP `Session`.

* Improved naming of RIB merge update statistic metrics.

* Introduce the concept of fatal vs non-fatal I/O errors and don't drop the BMP TCP input receiver on non-fatal errors (otherwise loss of the receiver causes the router read loop to abort anyway).

* FIX: Don't abort BMP TCP input reading on non-fatal I/O errors.

* Add a test for some of the bmp-tcp-in metrics.

* Import SeqCst directly to be consistent with other usages and to be less verbose.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants