This repository contains a set of Terraform scripts that automatically installs a set of minimum 3 Polkadot nodes to the chosen cloud. Currently, scripts support only AWS provider, GCP and Azure scripts are in active development phase.
The overall idea is to create a similar infrastructures in three separate regions - primary, secondary and tertiary. One of the nodes, created at these regions gets elected as a validator. The other two resume operations in archive mode, which allows to switch them to the validator fast once the current validator goes down. This is called "Warm standby backup".
The Polkadot node and the Consul server are both installed during instance startup phase. If an instance gets elected as a validator, Consul immediately transforms the regular Polkadot node into a validator. When the lock is lost - the node simply stops. Cloud will notice that the node is down and destroy it replacing it via the autoscaling mechanisms. The new node will be created as a regular one until it gets elected as a validator.
Note that some regions in clouds might not support all the required features. As the result these scripts will fail on this regions. Scripts were tested and approved to work on default regions only. Use all the other regions with extreme care.
As for the Leader election mechanism the script reuses the existing Leader election solution implemented by Hashicorp team in their Consul solution. The very minimum of 3 nodes is required to start the failover scripts. This requirement comes from the Raft algorithm that is used to reach consensus on the current validator.
The algorithm works by electing one leader accross all nodes joined to the cluster. When one of the instances goes down the rest can still reach consensus via a majority quorum. If the majority quorum is reached (2 out of 3 nodes, 3 out of 5 nodes, etc.), the new validator gets elected. Non-selected nodes continue to operate in non-validator mode. An even number of instances could cause the so-called "split-brain" to occur in case exactly half of the nodes go offline. No leader will be elected at all because no majority quorum can be reached with 2 out of 4 (3 ot out of 6, etc.) instances (51% of votes not reached).
Even if the entire region of a cloud provider goes down, this solution will ensure that the Polkadot node is still up given that two other regions are still up and thus the Consul cluster can reach the quorum majority.
This project contains 7 folders.
A folder that contains a CircleCI configuration. CI system verifies incoming PRs using a set of predefined tests.
This folder contains Terraform configuration scripts that will deploy the failover solution to AWS Cloud. Use terraform.tfvars.example file to see the very minimum configuration required to deploy solution. See README inside of AWS folder for more details.
These scripts will deploy the following architecture components:
Important! Unlike with other clouds Azure require you to clone the repository with --recurse-submodules
argument, so the whole command would be like git clone --recurse-submodules https://github.com/protofire/polkadot-failover-mechanism
. This is required, because Terraform Azure scripts relies on existing load balancer module in another repository.
This folder contains Terraform configuration scripts that will deploy the failover solution to Azure cloud. Use terraform.tfvars.example file to see the very minimum configuration required to deploy solution. See README inside of Azure folder for more details.
This folder contains Terraform configuration scripts that will deploy the failover solution to Google Cloud Platform. Use terraform.tfvars.example file to see the very minimum configuration required to deploy solution. See README inside of GCP folder for more details.
This folder contains the Dockerfile for the Docker image that is published on DockerHub.
This folder contains a set of tests to be run through CI mechanism. These tests can be launched manually. Simply go to the tests folder, and set up environment as described README.md
This folder contains a set of configuration and shell script files that are required for solution to work. VMs downloads these files during startup.
We are Protofire, a blockchain unit at Altoros. This repository was created to support and automate this project and is a result of the grant issued by the Web3 Foundation. In our work we mainly used Terraform to build the automatic IaC solution.
Feel free to contribute by opening issues and PR at this repository. There are no specific contribution rules.
This project is licensed under the Apache Public License v2.0. See the LICENSE file for details.