9. Consistency and Consensus

This chapter discusses various algorithms and protocols for constructing fault-tolerant distributed systems. The best way to build fault-tolerant systems is to find general-purpose abstractions with useful guarantees, implement them once, and let applications rely on those guarantees.

Consistency guarantees
- Two database nodes typically don’t have the same data, but most databases provide at least eventual consistency. A better name for it is convergence. It doesn’t guarantee when the replicas will converge.
Linearizability → Strongest consistency model in use
- In an eventually consistent DB, you may get two answers if you ask two replicas the same question simultaneously.
- Linearizability is to make a system appear as if there were only one copy of the data
- When a client completes a write, all clients reading from the DB must be able to see the written value.
- We can test if the system is linearizable by recording all queries and response times and checking whether they can be arranged.
- Where Linearizability useful
  - A single-leader replication system must ensure that there is only one leader, not multiple leaders. One way is to use a lock. Every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader. This lock must be linearizable. Apache Zookeeper is often used to implement distributed locks and leader election.
  - Unique constraints in databases (*such as user). We need linearizability if we want to enforce this as data is written.
- Implementing Linearizable Systems
  - The most common approach to make a system fault-tolerant is to use replication.
    - Single-leader replication
    - Consensus algorithm
    - Multi-leader
    - Leaderless replication
  - Important note: Unhelpful CAP Theorem. It is misleading because network partitioning is a type of fault, so it is not something about which you have a choice; it will always happen. The better way to phrase this theorem is that CAP would be consistent or available when partitioned.
Ordering guarantees
- Ordering is an essential topic in the book because it helps preserve causality
- Causality imposes an order on events.
- If the system obeys the ordering imposed by causality, then it is causally consistent. e.g, snapshot isolation provides that.
- Linearizability is stronger than causal consistency. Linearizability implies causality.
- Causal consistency is the strongest possible consistency model that does not slow down due to network delays and remains available in the face of network failures.
- Sequence number ordering. We can use sequence numbers or timestamps to order events, which is more practical than relying on causality, an important theoretical concept.
  - If there is no single leader, it is less clear how to generate the sequence numbers for operations. We can use different methods:
    - Each node can generate its own set of sequence numbers. For example, one node generates odd numbers and another generates even numbers.
    - You can attach a timestamp from a time-of-day clock (physical clock) to each operation.
    - You can preallocate blocks of sequence numbers. For example, node A can claim sequences from 1 to 100, and node B can claim sequences from 1001 to 2000.
  - But there is a problem, the sequence numbers they generate are not consistent with causality. This problem occurs because these sequence number generators do not correctly capture the order of operations across nodes.
  - There is a simple method for generating sequence numbers that is consistent with causality, known as Lamport timestamps. Each node has a unique ID, and each node keeps a counter of the number of operations it has processed. (counter, node ID). It provides total ordering, not bound to clocks; if one timestamp has a greater counter value, it is the greater timestamp. If the counter values are the same, the one with the greater ID is the one with he greater timestamp.
- Total order broadcast. In a single-core CPU, it's easy to define total ordering of operations, but in distributed systems, getting all nodes to agree on the same total ordering of operations is tricky.
  - Single-leader replication can determine a total order of operations by choosing one leader and sequencing all operations on a single CPU on the leader. But how can such a system be scaled? This problem is known as the total order broadcast or atomic broadcast.
  - Consensus services such as ZooKeeper implement total order broadcast.
Distributed Transactions
- IMPORTANT NOTE: Consensus is one of the most critical problems in distributed computing
- There are many situations in which it is essential for nodes to agree:
  - In single-leader replication, all nodes need to agree on which node is the leader. If some nodes cannot communicate with others due to a network fault, consensus is crucial to prevent bad failover, where two nodes mistakenly believe they are the leader.
  - Atomic commit - when we have transactions over several nodes, the transaction may fail on some nodes. If we want to achieve atomicity, we must get all nodes to agree on the outcome of the transaction (either all abort or all commit) - this is known as the atomic commit problem.
- Two-phase commit (2PC) algorithm.
  - For transactions in a single DB node, atomicity is commonly implemented by the storage engine. When the database commits the transaction, it makes the transaction write durable (in a write-ahead log) and then appends a commit record to the log on disk. If the DB crashes in the middle of this process, the transaction is recovered from the log when the node restarts.
  - But when multiple nodes are involved in trasnaction, sending a commit request to all nodes and indepedently commit the transaction on each node is not enogh, because commit can succed only on some nodes.
  - The two-phase algorithm (2PC) can achieve atomic transactions across multiple nodes.
  - With 2PC, instead of a single commit request, as with a single-node transaction, the commit/abort process is split into two phases.
    - It uses a coordinator, a library within the same app process that is requesting the transaction. Examples are Narayana, JOTM, BTM, or MSDTC (in the Microsoft world).
  - It begins with the app reading and writing data on multiple DB nodes (participants). When the app is ready to commit, Phase 1 of the coordinator begins: it sends a prepare request to each node, asking if they can execute, and tracks the responses. If all replies 'yes', then it sends a commit request. If any reply 'no’, it sends an abort request. If any of the prepare requests fail or time out, the coordinator aborts the transaction. If this fails too, it retries indefinitely.
  - If any of the repair requests fail or time out, the coordinator aborts the transaction. If this attempt also fails, it retries indefinitely.
  - However, when the coordinator fails to send the prepared request, a participant can abort the transaction. However, when the participant receives a prepared request and votes yes, they must wait to hear back from the coordinator. If the coordinator crashes or the network fails, the participant can only wait.
- Distributed Transaction in Practice
  - 2pc implemented transactions have a mixed reputation. They are important for safety, but they are causing operational problems, killing performance.
  - MySQL is reported to be over 10x slower than a single-node transaction.
  - There are two types of distributed transactions that are conflated:
    - DB internal transactions, where nodes that do not have to be compatible with othey systems
    - Heterogeneous distributed transactions between technologies, e.g, DBs from different vendors.
  - X/Open XA is a standard for implementing two-phase commit across heterogeneous technologies. It was introduced in 1991 and implemented by many traditional DBs, like PostgreSQL, MySQL, SQL Server, and Oracle, and message brokers such as ActiveMQ or MSMQ. It is a C API for a transaction coordinator, not a network protocol.
  - A problem with holding locks is that if the coordinator crashes and we keep a row lock (like in serializable isolation), those locks will be held for a long time. Recovering from coordinator failure can be done manually by an administrator or with a heuristic.
- Fault-tolerant consensus. The best-known algorithms are Viewstamped Replication (VSRT), Paxos, Raft, and Zab.
- Membership and coordination services.
  - Zookeper is designed to hold a small amount of data that can fit in memory. This data is replicated across all nodes using a fault-tolerant total order broadcast algorithm.
  - Where ZooKeeper excels is when you have several instances of a process and one of them needs to be designated as the leader. If the leader fails, one of the other nodes should take over.
  - Also, it is useful when you have partitioned resources (DB, stream, file storage, …), and when a new node joins the cluster, some partitions need to be moved from existing nodes to new nodes in order to rebalance the load (Rebalancing Partitions).
  - ZooKeeper and Consul are also used for service discovery.

My note: In distributed systems, everything is unpredictable, from clocks and networks to the consistency of data across multiple database nodes.