
- Offline systems
- The user needs to wait for the job to finish
- Usually scheduled to run periodically
- Performance measurement is throughput
- Batch processing with Unix tools. Building a tool for parsing logs with Ruby.
- He describes Unix philosophy, pipes, sorting, and standard tools
- MapReduce and Distributed Filesystems
- MapReduce is similar to Unix tools, but it is distributed across many machines.
- It reads and writes files on a distributed filesystem
- In Hadoop implementation the filesystem is called HDFS
- It uses a technique similar to RAID.
- How MapReduce job works
- Read a set of files, and break them into records
- Call the mapper function to extract key and value from each input record
- Sort all of the key-value pairs by key
- Call the reducer function to iterate over the sorted pairs
- These jobs are often paired in a workflow
