7. Transactions

The Slipper Concept of a Transaction
- ACID → It can mean many things, depending on the DB.
  - Atomicity → Something cannot be broken into smaller parts. It is not about concurrency. It describes what happens if a client wants to make several writes, but a fault occurs after some of the writes. If commands cannot write, the transaction is aborted, and the database must discard any writes it has made.
  - Consistency → You have certain statements about your data that always must be true. This is the property of an application, not a database.
  - Isolation → Concurrently executing transactions are isolated; they cannot interfere with each other. Classic DBs formalize isolation as serializability, meaning that each transaction can pretend to be the only transaction running on the DB. The DB ensures that when transactions have committed, the result is the same as if they ran serially. Today's DBs mostly use snapshot isolation, which is weaker, rather than serializable.
  - Durability → Once a transaction has committed, any data it has written will not be forgotten. In replicated DBs, the data has been successfully copied to several nodes.
- Single-Object and Multi-Object Operations. Multi-object transactions require determining which read and write operations belong to the same transaction. It is typically done on the client TCP connection to the DB server. Everything between BEGIN TRANSACTION and COMMIT is considered the same transaction.
  - Single object writes. Storage engines aim to provide atomicity and isolation other level of a single object (key-value pair) on one node. Atomicity can be implemented using a log for crash recovery, and isolation using a lock. E.g., when writing 20kb json, we don’t want to store only half if an interruption happens.
  - The need for multi-object transactions. Many DBs abandoned multi-object transactions because they are difficult to implement across partitions. Many frameworks, such as Rails ActiveRecord or Django, don’t retry aborted transactions; they just bubble up errors.
**Weak isolation Levels (**Non-seriazable).
- Read-committed.
  - Most basic level - Make two guarantees
    - When reading from the DB, you will only see data that has been committed (no dirty reads). This data is visible to others when the transaction commits. Dirty reads mean that another transaction may see some updates but not others.
    - When writing to the DB, you will only overwrite data that has been committed (no dirty writes). Concurrently update the same data by 2 transactions. → assume later overwrites the earlier write. However, it can also happen.
  - It is very popular and the default in Oracle, PostgreSQL, SQL Server, etc.
  - Most commonly, DBs prevent dirty writes by using row-level locks
  - Dirty reads are implemented → Db remember both old committed value and new value set by transaction that holds the write lock. While the transaction is ongoing, any other transactions that read the object are given the old value.
- Snapshot isolation
  - With read-committed, there could be concurrent bugs (Figure 7-6, page 237).
  - Read skew → non-repetable read. This can happen during long hours of making backups, analytic queries, and integrity checks.
  - Snapshot isolation is the most common solution to this problem. → Each transaction reads from a consistent snapshot of the database.
  - Implemented usually by using write locks to prevent dirty writes.
  - Key principle → readers never block writers, and writers never block readers.
  - To implement it, DBs use a generalization of the mechanism for preventing dirty reads (keeping several different committed versions of an object). → This technique is called multi-version concurrency control (MVCC). Figure 7-7 shows its implementation in PostgreSQL. Rows are marked by transaction, which means data is visible to that transaction.
  - Other problems can occur, like the lost update problem. This can be solved with atomic write operations, which remove the need to implement read-modify-write cycles in app code, explicit locking (SELECT * FOR UPDATE), automatic detection of lost updates, and compare and set.
  - Another problem that can happen is writing skew and phantoms.
Serialiability
- Read committed and snapshot isolation levels cannot work for transactions prone to race conditions.
- Strongest isolation level
- It guarantees that even though transactions may execute in parallel, the result is the same as if they had executed one at a time, serially, without any concurrency.
- DBs implement this in three ways:
  - Executing transactions in serial order requires that every transaction be small and fast. This method is limited to cases where the active datasets can fit in memory.
  - Two-phase locking (2PL). It blocks readers and writers. MySQL and SQL Server use it. A lock is placed on every object in the DB. The major downside is performance, as we lack concurrency.
    - Predicate locks. Lock applies to all objects that match the condition. They are time-consuming.
    - Index-range lock. Simplified version of the predicate looks.
- Serializable Snapshot Isolation (SSI). The algorithm provides full serializability but has only a minor performance penalty. Described in 2008 in Michal Cahill's PhD thesis. PostgreSQL is used from 9.1
  - Optimistic concurrency control + snapshot isolation.
  - It is based on snapshot isolation - all reads within a transaction are made from a snapshot of the DB.

My Note: SQL, Normalization forms, etc., are not described in the text, so it is assumed that the reader is familiar with these concepts.