3. Storage and Retrieval

Data structures that power Database

Storing and reading data from JSON costly O(n)
We need to use indexes. They make writing slow.
Trade-off in storage systems: Well-chosen indexes speed up read queries, but every index slows down writes, so DBs don’t usually index everything by default. But you choose the indexes.
Hash indexes
- Key-value store (similar to a dictionary type)
- The simplest possible index strategy is to keep an in-memory hashmap where every key is mapped to a byte offset in the data file - the location at which value can be found (Bitstack do this)
- To solve running out of disk space, is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and then make subsequent writes to a new segment file. We can then perform compaction on these segments.
- We do a companion of a key-value update log, retaining only the most recent value for each key. Can be done in a background thread.
- If a crash happens, we can restore each segment by re-reading the entire segment file from the beginning to the end.
- Hash-table has limitations → must fit in memory, and range queries are not efficient
SSTables and LSM-Trees
- We require that the sequence of key-value pairs is sorted by key. (Sorted String Table or SSTable). With this format, we cannot append a new key-value pair to the segment immediately.
- We merge several SSTable segments into one, retaining only the most recent value
- They’re basically files that store key-value pairs in sorted order. Because they’re sorted and written only once, reading and merging data is fast and predictable.
- In order to find a key, you no need to keep index of all keys in memory. You jump from the name that preceded it alphabetically.
- How the storage engine works:
  - When write comes in, add it to the in-memory balanced tree data structure (red-black tree), i.e. memtable
  - When the memtable gets bigger than some threshold (a few MBs), write it out to an SSD as an SSTable file.
  - To serve a read request, first try to find a key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
  - From time to time, run merging and compaction process inthe background
- IT works well, but it has one problem: if the DB crashes, the most recent writes are lost. We can use separate log on disk to which every write is immediately appenede and write it out to an SSTable.
- This is used in LevelDB and RocksDB, key-value storage engine libraries.
- Patrick O’Neil describes this indexing structure under the Log-Structured Merge-Tree (LSM-Tree)

B-Trees
- They also keep key-value pairs sorted by key
- B-Trees break the DB into fixed-sized blocks or pages (4kb in size) and read or write one page at a time.
- Each page can be identified using an address or location, which allows one page to referto another (similar to a point but on disk)
- The number of references to child pages in one page is called the branching factor.
- The algorithm ensures that the tree remains balanced → a B-tree with n keys always has a depth of O(log n).
- Most databases can fit into a B-tree that is 3-4 levels deep
- To make DB resilient to crashes, it is common to include an additional data structure on a disk: a write-ahead log (WAL or redo log), an append-only file to which every B-tree modification must be written before it can be applied to the page of the tree itself.

Comparing B-Trees and LSM-trees
- LSM-trees are faster for writes, while B-trees are faster for reads.
- LSM-trees can be compressed better, with smaller files on a disk
- The downside of LSM-trees is that the compaction process can interfere with the performance of ongoing reads and writes
- Another issue with compaction arises at high write throughput→ the disk's finite write bandwidth needs to be shared between the initial write and the compaction threads running in bg
- Mostly used for primary keys
Other indexing structures
- secondary index
- heap files
- clustered index
- concatenated index
- In-memory database (VoltDB, MemSQL, …)

Transaction Processing or Analytics

OLAP vs OLTP
SQL is good for both
Data warehousing → separate DB that analysts can use without affecting OLTP operations. Contains a read-only copy of the data in all various OLTP systems. Uses ETL to get data.
Star and Snowflakes → schema for analytics. Snowflake scheme, dimensions are broken down in subdimensions

Column-Oriented Storage

Don’t store all their values from one row together, but store all the values from each column together.