11. Stream Processing

Processing every event as it happens
Stream is data that is incrementally made available over time, e.g., stdin and stdout of Unix, filesystem APIs (Java FileInputStream), etc.
Transmitting Event Streams
- When the input is a file, the first processing step is to parse it into a sequence of records.
- A record is known as an event, which contains a timestamp indicating when it happened according to a time-of-day clock.
- The event can be encoded as a text string, JSON, or some binary form.
- A database or file is sufficient to connect producers and consumers, where producers write to a datastore and consumers periodically pull data.
- Messaging Systems
  - This is a common approach to notify consumers about new events.
  - With the pub/sub model, different systems take a wide range of approaches. To understand better, we need to ask two questions:
    - What happens if the producer sends messages faster than the consumer can process them (it can drop, buffer, apply backpressure, or flow control)
    - What happens if a node crashes or temporarily goes offline? Are messages lost?
  - Direct messaging from producers to consumers
    - Many messaging systems utilize direct network communication between producers and consumers, such as UDP multicast in the financial industry, brokerless libraries like ZeroMQ, or StatsD and Brubeck.
  - Message brokers. The alternative is to send messages via message brokers (a queue). Producers write messages to the broker, and consumers receive them by reading from the broker.
    - Databases typically store data permanently, while brokers automatically delete it when it is delivered to consumers.
    - Examples include RabbitMQ, ActiveMQ, and Azure Service Bus, among others.
    - Message brokers use acknowledgments to ensure that messages are not lost. Therefore, the client must notify the broker when it has finished processing a message so that it can remove it from the queue.
- Partitioned Logs
  - A log is an append-only sequence of records on disk.
  - Log can be partitioned and the partitions stored on different machines.
  - A topic can be defined as a group of partitions that all carry messages of the same type.
  - Apache Kafka and Amazon Kinesis are log-based message brokers that operate similarly.
Database and Streams
- The replication log is a stream of database write events.
- To maintain system synchronization, we can utilize dual writes; however, race conditions may still occur. But also, if one of the writers fails, while the other succeeds.
  - Change Data Capture (CDC) is a method of monitoring all data changes written to a database and extracting them in a format that can be replicated to other systems. It is more interesting if changes are made available as a stream.
    - For example. You can capture changes in the DB and continuously apply the same modifications to a search index.
    - CD makes one DB the leader and turns others into followers.
    - We can use database triggers to implement change data capture (CDC) by registering triggers to observe all changes to data tables. This is viable but not performant. There are other options, such as LinkedIn Databus and Facebook Wormhole, among others.
    - Log compaction is a good alternative to keeping a limited amount of log history. Here we keep just the most recent update for each key.
  - Event Sourcing (now popular). It also stores all changes to the app state as a log of change events. In the CDC, the app utilizes the database in a mutable way, updating and deleting records. However, in event sourcing, the app logic is built on the basis of the immutability of events in an event log. Usually, it's not interesting to see the entire history, but only the current state, so apps need to log events and transform them into an app state suitable for showing to the user.
    - Here we distinguish between events and commands. When a request comes, it is initially a command, and it may fail. When it is validated by the app and executed successfully, it becomes an event.
Processing Streams - there are three options:
- Take data from the events and write it into a DB, cache, search index, or similar storage system, from where it can be queried.
- Push the events to the user in some way (e.g., send email or push notifications)O
- Process one or more input streams to produce output streams. This is closely related to the Unix processes and MapReduce jobs.
  - It is used in fraud detection systems or trading systems.

My Note: The book is outdated. For example, Kafka is almost not mentioned, as well as modern Data and ETL solutions (everything after 2019). There are no details about any specific system; it is mostly theory behind them, but in-depth details of every concept. For whom is this book usable? Additionally, it transitions from database implementation details to distributed systems, where communication occurs over a network.