
- Keeping a copy of the same data on multiple machines connected via a network
- You might want to replicate data:
- To keep data geographically close to users (reduce latency)
- Allow the system to continue working even if some of its parts have failed (increase availability)
- Scale the number of machines that can serve read requests (increase throughput)
- Here, we assume that the dataset is so small that one machine can handle a copy of the entire dataset.
- Three popular algorithms to replicate changes between nodes:
1. Single-leader
- Each node that stores a copy of the DB is called a replica
- With multiple replicas, how do we ensure that all the data is replicated to all the replicas?
- The process is as follows (leader-based or active/passive or master/slave):
- One of the replicas is designed as the leaders → and all writing happens here
- Other replicas are followers → when the leader writes new data or updates its local storage, it sends the data change to all of its followers as part of the replication log or change stream
- When a client wants to read from the DB, it can query the leader or any follower.
- Many databases use this feature, such as PostgreSQL, MySQL, Oracle Data Guard, SQL Server Always On, as well as MongoDB and Espresso. Kafka and RabbitMQ also use it.

- An important detail of the replicated system is whether the replication happens sync or async
- Sync → leader waits until follower has confirmed it received it before posting success to user (guaranteed to have up-to-date copy of data)
- Often, leader-based replication is configured to be completely asynchronous. Write is not guaranteed to be durable.

- Setting up new followers
- From time to time, we need to acquire new followers. How do we ensure that new followers have an accurate copy of the leader data?
- We cannot just copy all data files from one note to another
- So, we do it like this:
- Take a snapshot of the leader DB at some point (sometimes we need to use 3rd party tools such as innobackupex for MySQL)
- Copy the snapshot to the new follower node
- The follower connects to the leader and requests all data changes since the snapshot
- When followers process all the data changes since the snapshot, we can continue to process data changes from the leader that may happen.
- Handling node outages (how to achieve high availability)
- Follower failure → On the local disk, each follower maintains a log of data changes received from the leader. If a follower crashes and restates, the follower can recover from its log.
- Leader failure → One of the followers needs to be promoted as a new leader, the client needs to be reconfigured to send their writes to the new leader, and other followers need to start consuming data changes from the new leader (failover)
- Failover can happen manually or automatically.
- Some things can go wrong with failure:
- If async replication is used, the new leader may not receive all writes from the old leader
- Discarding writes is dangerous if other storage systems outside the DB need to be coordinated with DB contents
- In specific fault scenarios, two nodes may both believe that they are the leader (called the " split brain).
- Implementation of Replication Logs
- Statement-based replication → leader logs every write request (statement) it executes and sends to the follower. For relational DBs, it is every SQL statement.
- It can break due to different reasons → Calling non-deterministic functions such as NOW() or RAD() generates different values, or for an autoincrement column, may be executed differently.
- It was used in MySQL before version 5.1 → now it uses row-based replication
- Write-ahead log (WAL) shipping → Log is an append-only sequence of bytes containing all writes to the DB. We can use the same log to build a replica on another node. Besides writing to the disk leader, send it across the network to the followers.
- Used in PostgreSQL and Oracle
- The main disadvantage is that logs describe the data on a very low level
- Logical (row-based) log replication
- Logical log instead of physical data representation
- It isa sequence of records describing writes to DB tables at the granularity of row:
- For the inserted row, the log contains new values for all columns
- For a deleted row, the log contains enough information to uniquely identify the row that was deleted (primary key)
- For the updated row, the log contains enough information to identify the updated row and the new values of all columns
- MySQL binlog used this approach
- My note: NO EXAMPLES OF LOGS!!
- Trigger-based replication
- Some tools, such as Oracle GoldenGate, can make data changes available to an app by reading the database log → alternative is to use triggers and stored procedures
- Trigger allows to register custom app code that is executed with a data change in DB
- Problems with the replication log
- Leader-based replica allows that workflows with mostly read areas are an attractive option to create many followers and distribute read requests across them, but it only works with async replication.
- The problem can be replication lag → delay between write on the leader and being reflected on a follower.
- Read your own writes (we need read-after-write consistency)
- Monotonic reads (reading from async followers so users see things moving backward in time) → To achieve it, we make sure that each user reads from the same replica
- Consistent prefix reads → consistent prefix reads → if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
2. Multi-leader
- Single leader replication has a problem, its leader is not accessible, you can’t write to DB
- Allow more than one node to accept writes, and replication still happens in the same way, each node forwards data changes to all other nodes → multi-leader replication

- Use cases for multi-leader replication
- If Db is replicated in several data centers, we have a leader in each data center
- Compare performance, outage tolerance, and network problems, multi-leader vs single-leader
- Multi-leader has a significant downside; the same data may be concurrently modified in two different data centers, and those write conflicts must be resolved.
- It is also convenient if you have an app that works even when disconnected from the internet (such as calendar apps on your phone or laptop). Every app has a local DB that acts as the leader, and there is a multi-leader replica process between replicas.
- Also, real-time collaborative editing apps such as Google Docs
- Handling Write Conflicts → biggest problem
- Wiki page edited by two users → conflict can occur
- Two users simultaneously edit a wiki page.
- Sync vs Async conflict detection. A simple strategy to avoid conflicts is to ensure that all writes for a particular record go through the same leader; this way, conflicts cannot occur.
- The final value can be based on the highest unique ID of each write or timestamp (last write wins - lww). Or let writes that originated at a higher-numbered replica always take precedence over writes that originated at the lower-numbered replica.
- Custom conflict resolution logic can be handled within the app when the database system detects a conflict in the log of replicated changes, either during write or read operations.
- Automatic conflict resolution. An example of an Amazon shopping cart that would preserve items added to the cart, but not items removed from the cart. So items reappeared. Read more about Conflict-free replicated datatypes (CRDTs), Mergable persistent data structures, and operations transformation.
- Multi-Leader Replication Topologies
- There could be multiple topologies with multiple leaders: circular, star, and all-to-all.
- All-to-all is the most general topology, where each leader sends its writes to every other leader. It has a drawback that some network links may be faster than others so that some replication messages may overtake others.
- MySQL uses a circular topology by default, in which each node receives writes from one node and forwards those writes to another node.
- In circular and start topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes must forward data changes they receive from other nodes.