viewstamped replication
Recently I've read a paper on viewstamp replication (VSR), a state-machine replication protocol. State-machine protocols work by ensuring all replicas perform the same deterministic operations, ensuring every replica reaches the same end-state. VSR consists of three main steps, and is simple to understand:
- Normal Operation, handles client requests and ensures consistency.
- View Change, elects a new primary when the current one fails.
- Recovery, brings a failed node back into the system.
Normal Operation
A client will send a request to the primary to perform some operation X.
The primary will first write X to it's logs and broadcast a prepare
message to the cluster signaling it's got a request and wants to perform the operation X.
Upon receiving the message, replicas will then write X to their logs, and send an prepare-ok
message back to the primary acting as an acknowledgment.
Once the primary gets the majority of the node's agreement, operation X is then considered committed.
The primary finishes by performing the operation X.
Replicas won't perform the operation till it receives the next message from the primary,
which will either be a new request for operation Y in which case has to perform the operation X before proceeding, or a commit message for X (which is kinda a heartbeat from the primary to say it's still healthy and no new requests have came in).
View Change
This is when the primary fails, the system must detect that and assign a replica to be the new primary.
Replicas will suspect the primary failed if they don't received a prepare
or commit
message before they time out.
Noticing replicas begins by changing it's status to view-change (stop processing messages); increment it's view number (to start a new view); and send out a start-view-change
message with the new view, to the cluster.
Other replicas can either disagree with the new view (if it's own view is greater than the proposed new view); or agree (it's current view is less than the proposed new view) and continue processing as a noticing replica.
Lots of message will be sent and replicas that gets the majority vote on the correct new view will send out a do-view-change
message to the a new primary.
The new primary will collect the logs sent to it and chooses the log based on the most recent view, breaking ties with the one with the most recent operation.
After it'll change it's status back to normal (to begin processing requests) and send a start-view
message to the cluster.
Subsequently all replicas receiving the start-view
will change their status back to normal.
Recovery
Recovering replicas will set it's status to recovery, preventing it from participating in the normal and view change sub-protocols.
It will then send a recovery
message the cluster and wait to acknowledgments back from replicas.
Normal operating replicas will send an recovery-response
message, with the primary sending over it's up-to-date logs.
The recovering replica will perform all the operations to bring itself up-to-date and set it's status back to normal.