[Feature]: Highly-available & fault-tolerant validators #17186

outofforest · 2023-07-29T11:56:50Z

Highly-available & fault-tolerant validators

This document describes my findings related to possible implementation of highly available, fault-tolerant validators.

The purpose of this issue is to discuss this topic with Cosmos SDK team and other interested people, check if this functionality is desired (I believe it is!), provide more details for unclear parts and discover possible dependencies in code I missed. Feel free to comment and give suggestions.

After finishing discussions, if I see that implementing this is possible, I will invest some time to do it.

Motivation

Let's say you run a validator. So you have a server or virtual machine where validator runs.
Now you need a disaster recovery plan in case server goes down or needs a maintenance, because you don't want to be slashed.

For this, you need at least two servers and a procedure to switch validator from one server to another. It might be done
manually or automatically, by using some heartbeat software.

The problem is, no matter what option you choose, currently the same private key of the validator must be moved between
or coexist on both machines. This leads to some problems:

private key, for security reasons, should never ever leave the machine where it was generated on
if mistake is made, by human or software, and both copies of the validator are active at the same time, validator is slashed and tombstoned.

I think we all have seen the cases when even professional companies experienced their validator being tombstoned because of some mistakes.

The source of the issue is that Cosmos SDK has never been designed with those scenarios in mind.
So I started thinking on how it could be fixed, and I've found a possible fix.

If assumption is made that private key of the validator can never leave the machine where it was generated,
then it leads us to an obvious conclusion that each server (main and backup) must hold its own private key.
But then problem arises because validators in Cosmos SDK may sign blocks and proposals only with single private key.
This led me to another conclusion that I need to break this assumption.

After thinking more about it, I developed the idea of this framework:

each machine has its own private key, which is never shared with any other machine
validator is defined on chain by providing all the corresponding public keys (each key representing single HA node) - so, in this design each validator has many keys assigned to it, not just one
only one public key (HA node) may be active at a time - only one server running the validator may sign and propose blocks - the one having private key corresponding to the active public key. Other instances are not considered to be validators.
if the active server goes down, active public key might be switched to another one configured for that validator, so another machine starts signing on behalf of that validator.

I see three possible conditions for switching the public key:

manual, when the staff plans to turn off the active server - it might be done personally by the operator or someone else may be given permissions using authz to do it on behalf of the operator
automatic-off-chain - by any software developed by the operator, used to monitor the servers - as in the case above, by issuing a transaction, signed by allowed private key
automatic-on-chain - by the chain itself - whenever validator misses to propose a block on its turn, public key could be automatically rotated by the consensus protocol

This functionality could be implemented as an extension to the current staking module, eliminating the huge problem validator operators experience when maintaining the validators.

Mechanics of the HA node switching

On the CometBFT side the thing is simple. There is a set of validators, each represented by the public key and voting power.
At the moment there is 1 to 1 relationship between validator in CometBFT and Cosmos SDK.

By implementing this proposal I want one CometBFT validator to be represented by n possible nodes (public keys) in Cosmos SDK,
grouped under common operator's address. At any time exactly one public key in Cosmos SDK is active for each operator,
as a result the 1 to 1 relationship between CometBFT and Cosmos SDK is still maintained. The only difference is that the set
of CometBFT validators is "more dynamic".

In practice, it means that whenever the active public key of Cosmos SDK validator is changed, it must replace the old one in CometBFT
by issuing the validator update in the end blocker of the staking module, providing the active public key with the same voting power.
As a result, CometBFT "knows" only the active public keys constituting the set of active validators.

Terminology

The problem with terminology arises because the word validator may have many meanings now:

a member of validator set in CometBFT
the validator defined by the operator in Cosmos SDK, grouping many fault-tolerant, highly-available servers
the server running the blockchain node

In the spec below I use HA node for the third meaning. But the good wording for first and second meaning is welcomed.

End blocker

As mentioned earlier, whenever the set of active HA nodes is changed, staking module must prepare the set
of validator updates to be passed to CometBFT.
The old HA node must be removed, by setting the voting power of the corresponding public key to 0, and new active one must be added
instead, by setting the voting power for its public key.

Create validator tx

When validator is created it is identified by the operator address which might be treated as an ID of the validator.
Cosmos SDK already enforces that only one validator may be run by each operator, so it's already unique.

When validator is created, its public key is passed as an independent field, meaning we may create many HA nodes,
each using different public key.

func (k msgServer) CreateValidator(ctx context.Context, msg *types.MsgCreateValidator) (*types.MsgCreateValidatorResponse, error)

There is a check verifying that public key is not used by any other validator. I must do the same to check that public key
is unique across all the HA nodes.

Relations to other modules

At the end, AfterValidatorCreated hook is called. slashing and distribution modules subscribe to this hook:

distrobution: fields related to rewards and commissions are initialized. All the operations there, use only the
operator's address so my changes don't affect the logic there. Nothing needs to be modified
slashing: consensus address -> public key relation is stored by the hook. That mapping is used only by
the evidence module to check that the consensus address key reported in the evidence exists in the system. I believe
this is not needed. Anyway, consensus address is derived from the public key, so it is 1 to 1
relationship for each HA node, meaning I may just add the mapping for each node.

To do this, I need new hooks:

HA node created
HA node deleted
to maintain the mapping inside slashing module.

Managing relations between validator and its HA nodes

Proto of staking module defines Validator message containing consensus_pubkey field, storing the public key of the validator.
It must be converted into a slice to store many public keys of all the HA nodes.
There is ConsPubKey() (cryptotypes.PubKey, error) method used in a couple of places to get that key. As there is no single
key for the validator anymore it must be converted to one of the options:

calls to this might be simply removed if not really needed
return all the public keys of all HA nodes
return single public key for provided consensus address
It hasn't been identified yet which solution fits the purpose of each call.

The Validator message should be extended by adding active_consensus_pubkey field indicating which HA node is active at the moment.

There is ValidatorSigningInfo map mapping consensus address to some metrics and information.
That structure contains fields related to the validator itself (not a particular HA node), except the consensus address itself.
The Address field is not used anywhere, so maybe it might be simply removed. Then, the operator's address should be used
as a key in that map because this is the value uniquely identifying the validator, not the consensus address.

HA node states and active node switching

When validator is created, the provided public key constitutes the first HA node. This node is automatically set to active state.
There are 3 possible states for an HA node:

active - it means this HA node signs and proposes blocks - only one HA node per validator might be in this state
enabled - HA node in this state does not sign anything until it is set to active
disabled - HA node in this state does not sign anything and cannot be set to active - it must be set to enabled first

The difference between enabled and disabled is that operator may grant someone else (using authz) permission
to change the active HA node (move it from enabled to active) but at the same time operator might decide that there
are some HA nodes (disabled ones) which cannot be activated, e.g. servers might be maintained or intentionally turned off.

This means that hypothetical heartbeat application might exist, monitoring the status of the servers and switching the active
HA node automatically if the current one is dead. The application should use its own private key (not the one belonging to the operator).
and that private key should be permitted to (with authz) to broadcast transaction selecting the active HA node from the set of enabled ones.
At the same time this private key should not be allowed to enable a disabled HA node.

New transactions

New transactions need to be added to the staking module for:

adding HA nodes
deleting HA nodes
enabling HA node
disabling HA node
selecting active HA node

CreateValidator tx must be modified accordingly to immediately create and activate the first HA node for the validator.
Looks like the structure of the message does not need to be changed.

New queries

querying statuses of HA nodes of the validator
querying the active HA node

To do in next steps

The next step after implementing this proposal would be adding an option to switch the active HA node automatically by the chain
if the current active one missed the opportunity to propose the block. This failover mechanism would eliminate the need
for having the heartbeat application described above because its role would be taken by the chain itself.

The text was updated successfully, but these errors were encountered:

outofforest added the T:feature-request label Jul 29, 2023

cosmos locked and limited conversation to collaborators Jul 29, 2023

alexanderbez converted this issue into discussion #17189 Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[Feature]: Highly-available & fault-tolerant validators #17186

[Feature]: Highly-available & fault-tolerant validators #17186

outofforest commented Jul 29, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

[Feature]: Highly-available & fault-tolerant validators #17186

[Feature]: Highly-available & fault-tolerant validators #17186

Comments

outofforest commented Jul 29, 2023 • edited Loading

Highly-available & fault-tolerant validators

Motivation

Mechanics of the HA node switching

Terminology

End blocker

Create validator tx

Relations to other modules

Managing relations between validator and its HA nodes

HA node states and active node switching

New transactions

New queries

To do in next steps

This issue was moved to a discussion.

outofforest commented Jul 29, 2023 •

edited

Loading