Skip to content

Failure Scenarios and Precautions

odorfer edited this page Feb 10, 2021 · 12 revisions

In the following, some possible error scenarios of the present project will be briefly outlined. However, this is not a complete documentation of all theoretically possible error scenarios.

Failure Scenarios

Application: CockroachDB

Description: An instance of CockroachDB fails.

Precaution: CockroachDB clusters are fault-tolerant by design and are able to automatically recover from failures (https://www.cockroachlabs.com/docs/v20.2/disaster-recovery.html).

Status of Precaution: Done

Application: Grafana

Description: The server which runs the Grafana instance fails. The Grafana process fails.

Precaution: Provide multiple redundant servers with a running Grafana instance.

Status of Precaution: Not Planned

Feasibility: Feasible

Application: Docker

Description: Docker fails on a given virtual machine.

Precaution: If Docker fails on a given virtual machine, the CockroachDB instance which should run within the container won't be available. Therefore CockroachDB must redeploy the instance on a new virtual machine. If the failed instance gets back online, we need to destroy the new instance immediately.

Status of Precaution: ?

Feasiblity: ?

Application: node_exporter

Description: node_exporter fails on a running virtual machine.

Precaution: The Prometheus instance which collects the metrics exported by node_exporter can detect if a node_exporter isn't reachable and execute a script which deletes the virtual machine and redeploy it.

Status of Precaution: Not Planned

Feasiblity: Feasible

Application: Consul

Description: One or more Consul instances fail.

Precaution: Implement a recovery strategy according to the official documentation for outage recovery (https://learn.hashicorp.com/tutorials/consul/recovery-outage).

Status of Precaution: Not Planned

Feasiblity: ?

Virtual Machine

Description: The virtual machine running an instance of CockroachDB fails.

Precaution: Each virtual machine sends monitoring data to the central Prometheus instance using the Prometheus Node Exporter (https://prometheus.io/docs/guides/node-exporter/). The central Prometheus instance pulls the date from each virtual machine. If a virtual machine does not respond, an alert is triggered, which in turn will trigger a redeployment of said instance. The old instance is deleted with the help of Terraform and replaced by a new one.

Status of Precaution: Planned

Feasibility until project end: Feasible

Network Fault / Increased Response Time

Description: The network connection to a cloud provider is not available or is overloaded (A whole cloud provider fails or is not available).

Precaution: A DNS-based failover approach could be used. However, the next server in the DNS entry is simply selected in round-robin fashion without checking whether it is available. Therefore, this solution approach would have to include error detection, which checks whether the resources of the respective cloud provider are available and, in the event of an error, removes the public IP of the network hosted by this cloud provider from the DNS entry. Another problem with this approach is the caching of DNS entries at all levels. This could be optimised by making the TTL as short as possible. There exist commercial solutions like this: https://aws.amazon.com/route53/.

Status of Precaution: ?

Feasibility until project end: ?

Network failure in local network (Cloud provider inter vm connection)

Description: Individual VMs are reachable over the internet but cannot reach each over through a local connection.

Precaution: __

Status of Precaution: ?

Feasibility until project end: ?

VPN Gateway

Description: The VPN gateway through which the instances of one cloud provider can communicate with the instances of another cloud provider fails.

Precaution: Deploy a second redundant VPN gateway per cloud provider in order to enhance availability in case of failure (see here: https://github.com/amer/resinfra/wiki#gateway-as-the-single-point-of-failure).

Status of Precaution: Planned

Feasibility until project end: Feasible

Human Error: Delete .tfstate file

Description: Accidentally delete the .tfstate file

Precaution: Distribute the .tfstate file across the infrastructure and keep the individual copies in sync.

Status of Precaution: ?

Feasibility until project end: Feasible

Human Error: terraform destroy

Description: Accidentally execute the terraform destroy command

Precaution: Use modules in order to minimize the damage of an accidentally executed terraform destroy command. Use the prevent_destroy meta-argument to prevent the deletion of stateful ressources like databases (https://www.terraform.io/docs/language/meta-arguments/lifecycle.html).

Status of Precaution: ?

Feasibility until project end: Feasible

Security Breach

Description: An attacker gains access to one of the VMs of the system.

Precaution: Use Identity and Access Management best practices like least privilege in order to restrict the access to our system components and prevent attackers from gaining wide-ranging access to the system. Access credentials like keys should be stored in hardware security modules (https://en.wikipedia.org/wiki/Hardware_security_module). Rotate key and access credentials frequently in order to minimize impact from potentially stolen credentials and keys. The terraform files shouldn't contain any access credentials.

Status of Precaution: ?

Feasibility until project end: ?