[Production readiness] measure accuracy of metering in the context of "real" traffic #3776

MonsieurNicolas · 2023-06-13T19:18:08Z

This complements #3759 that tracks production readiness for dapp developpers and to help validators pick "market" prices for different resources.

This issue tracks that we need to make sure that we have the right data streams/processes in place that we can monitor calibration accuracy:
we have ways to perform calibration based on synthetic data (ie "fancy tests") in the host crate.

We need something to help us determine how close we were from our target models when we compute cpuInstruction.

We need to be able to detect when we're "very wrong", both in places where we overestimate (in which case, we're missing out on capacity), and underestimate (possible DoS attack vector).

There are a couple ways we can try to measure this:

cpuInstructionCount/executionTime
- pros: easiest to measure, "true" (in that cpuInstruction is a proxy for execution time)
- cons: metric depends on runtime environment
cpuInstructionCount/actualCpuInstructionCount
- pros: gives direct feedback on models
- cons: architecture dependent (x86), probably requires high privilege, does not catch issues related to memory access (like impact of CPU caches, etc)

I think that we could do a bit of both (in order of impact, descending):

track cpuInstructionCount/executionTime at the transaction and ledger level, exported as a medida metric (this would allow to catch larger trends)
hook up Tracy to track cpuInstructionCount/executionTime as close as possible to components (this would allow us to quickly identify model issues)
hook up Tracy to track cpuInstructionCount/actualCpuInstructionCount (maybe to be used with special build flavor?) , which we can use when running under controlled environments (to match calibration environment)

We can then use this instrumentation both as part of tracking node health and when replaying historical data (catchup), the later can also be used as part of acceptance criteria when validating builds.

The text was updated successfully, but these errors were encountered:

MonsieurNicolas · 2023-06-13T19:19:34Z

@anupsdf @jayz22 @graydon -- we should probably get this done as part of the testnet push

graydon · 2023-07-06T23:12:38Z

In discussion with @jayz22 today, he noted that the instructions-to-real-time ratio we observe on the dashboard shows a drop when there's a lot of data moving through the system, and this might be caused by the fact that the XDR serialization and deserialization that happens on the rust side of the rust bridge (in contract.rs) isn't accounted-for in the block of code that has a budget active, but it is currently accounted-for by the real time clock (that is done by a medida TimeScope object, on the C++ side).

This is relatively easy to fix:

use rust's std::time::Instant::now() function to track the narrower time-scope on the rust side
plumb the resulting difference-of-times, as a u64 nanosecond duration, back to C++
feed that duration into the Medida::Timer directly, rather than using a TimeScope

This should improve accuracy of the time-to-instructions ratio.

jayz22 · 2023-08-01T20:38:02Z

We have metrics and dashboard tracking cpuInstructionCount/executionTime. #3847 also adds an "invoke time" metrics (which is more directly related to the cpuInstructionCount than the "operation time") which should fix the divergence mentioned above. Dashboard needs to be updated to reflect it once the change goes alive.
Closing this for now as there are no actionable item. Feel free to reopen it in the future if more "advanced" measurements becomes necessary.

MonsieurNicolas added the enhancement label Jun 13, 2023

anupsdf assigned graydon Jun 22, 2023

jayz22 mentioned this issue Jul 20, 2023

Emitting diagnostic event containing Soroban resource utilization metrics #3847

Merged

6 tasks

jayz22 closed this as completed Aug 1, 2023

jayz22 mentioned this issue Aug 1, 2023

Update Graphana dash boards with recently added metrics #3867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Production readiness] measure accuracy of metering in the context of "real" traffic #3776

[Production readiness] measure accuracy of metering in the context of "real" traffic #3776

MonsieurNicolas commented Jun 13, 2023

MonsieurNicolas commented Jun 13, 2023

graydon commented Jul 6, 2023

jayz22 commented Aug 1, 2023

[Production readiness] measure accuracy of metering in the context of "real" traffic #3776

[Production readiness] measure accuracy of metering in the context of "real" traffic #3776

Comments

MonsieurNicolas commented Jun 13, 2023

MonsieurNicolas commented Jun 13, 2023

graydon commented Jul 6, 2023

jayz22 commented Aug 1, 2023