Basic concurrency support for transaction prover service #908

igamigo · 2024-10-07T14:53:27Z

The current version of the transaction proving service processes exclusively one transaction at a time. To improve performance and scalability, we want to introduce concurrency while ensuring the system remains resilient against potential DDOS attacks and resource exhaustion.

Basic desired features:

Task queue: We want to manage multiple transaction proofs concurrently without overwhelming the system, so we should introduce a basic task queue where incoming valid requests are placed and processed by any available worker.
- Concurrency control: A first basic approach would be to define available workers based on system resources and once the workers are not available, additional transaction should be put on hold or rejected.
- Investigate options for distributing workers across nodes.
Rate limiting: We should try to reduce the attack surface for DDoS attacks by rate limiting incoming requests probably based on incoming request IPs.
Timeouts: I'm unsure if there is any simple way in which an incoming transaction could stall a worker but there should be timeout mechanisms. API clients could also implement exponential backoff retries.
Simple monitoring: It would be nice to implement an endpoint with (initially basic) metrics related to the service health and status.

bobbinth · 2024-10-08T01:22:14Z

Looks great! A couple of additional comments:

We should assume that there is only one worker per machine right now. As a worker comes online, it could ping the coordinator to notify them that it is available and then the coordinator would add the worker to the pool of available workers.
I think we should have a configurable timeout. If proof generation takes more this timeout, the coordinator should assume that the worker is dead and remove it from the pool. If the work is not in fact dead, it should be able to detect this somehow.
It wold be really nice if the coordinator could somehow request increase in capacity (i.e., in the number of workers). The default implementation of this could do nothing, but in the future this could be configured to request a new AWS spot instance, for example.

And the last point is that ideally we'd not implement it ourselves but would be able to use an already existing component/framework.

SantiagoPittella · 2024-10-17T16:19:53Z

Hello! I've been researching a bit on this issue, in particular using Cloudflare's Pingora crates. I think that it fits as a solution for all our problems here.

Task queue: We want to manage multiple transaction proofs concurrently without overwhelming the system, so we should introduce a basic task queue where incoming valid requests are placed and processed by any available worker.
Concurrency control: A first basic approach would be to define available workers based on system resources and once the workers are not available, additional transaction should be put on hold or rejected.
Investigate options for distributing workers across nodes.

This can be tackled with Pingora's LoadBalancer and setting the config to 1 thread per service. I can elaborate a PoC of this with a simple "hello world" server shortly.

Related to this is also the creation and destruction of workers. At least in my first approach I was thinking of manually running instances of the server and adding the endpoints to the upstream configuration of the load balancer; this will also work for destructing workers (remove the server from the list of upstreams, reload, turn of the prover server). This can be benefited from the Graceful Upgrade that Pingora supports.

Rate limiting: We should try to reduce the attack surface for DDoS attacks by rate limiting incoming requests probably based on incoming request IPs.

For this we can use the rate limiting that the crate has out of the box, just setting the max amount of request per user per second as described here.

Timeouts: I'm unsure if there is any simple way in which an incoming transaction could stall a worker but there should be timeout mechanisms. API clients could also implement exponential backoff retries.

We can also use the Pingora's timeout out of the box for this, which is easily configurable.

Simple monitoring: It would be nice to implement an endpoint with (initially basic) metrics related to the service health and status.

It also has a builtin Prometheus server that can be used for that purpose.

If you think that it is ok, I can proceed the following way:

Creating the Proof of Concept of the Load Balancer (also want to check a couple things about this).
Adding the real Load Balancer with a couple of proving servers.
Rate limiting & Timeouts.
Metrics.

All of these can be done in different issues and if you agree I can start immediately.

bobbinth · 2024-10-18T07:23:43Z

This sounds great! Let's start with the PoC to see if we hit any roadblocks.

In terms of the setup, I was thinking to load balancer could run on a relatively small machine (e.g., 4 cores or something like that), and the provers would run on big machines (e.g., 64 cores). Is that how you are thinking about it too?

bobbinth added this to the v0.7 milestone Oct 8, 2024

bobbinth modified the milestones: v0.7, v0.6 Oct 17, 2024

bobbinth assigned SantiagoPittella Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic concurrency support for transaction prover service #908

Basic concurrency support for transaction prover service #908

igamigo commented Oct 7, 2024

bobbinth commented Oct 8, 2024

SantiagoPittella commented Oct 17, 2024

bobbinth commented Oct 18, 2024

Basic concurrency support for transaction prover service #908

Basic concurrency support for transaction prover service #908

Comments

igamigo commented Oct 7, 2024

Basic desired features:

bobbinth commented Oct 8, 2024

SantiagoPittella commented Oct 17, 2024

bobbinth commented Oct 18, 2024