Note: This post is a work in progress. Suggestions and corrections are welcome.
Distributed systems cover a wide range of concepts and continue to evolve, which makes them difficult to learn in a unified way. Rapid research output and ongoing innovation spread essential material across many sources. Traditional learning paths often leave gaps. The list below groups core topic areas, research papers, and structured resources to help build a solid foundational understanding.
Core Topics
- Time and Event Ordering (physical vs logical clocks, Lamport clocks, vector clocks)
- System and Failure Models; Impossibility and Trade-Offs (crash vs Byzantine, FLP, CAP)
- Communication and Broadcast (reliable, causal, total-order/atomic, gossip dissemination)
- Replication and Consensus (state machine replication, Paxos family, Raft, Viewstamped, Chain Replication)
- Quorum and Consistency Models (quorum reads/writes, linearizability, sequential, causal, eventual)
- Data Partitioning and Placement (sharding, consistent hashing, range partitioning)
- Transactions and Concurrency Control (2PC/3PC, MVCC, deterministic/Calvin-style, timestamp ordering)
- Conflict Resolution and CRDTs (operation-based, state-based, convergence principles)
- Membership and Failure Detection (heartbeats, accrual detectors, SWIM)
- Caching and Latency Engineering (TTL strategies, invalidation, hedged/backup requests, tail latency)
- Storage Internals (logs, LSM trees, compaction, snapshotting)
- Scheduling and Resource Orchestration (cluster managers, Borg/Omega/Kubernetes concepts)
- Observability and Debugging (distributed tracing, metrics, structured logging)
Interesting Research Papers
- MapReduce: Simplified Data Processing on Large Clusters
- The Google File System
- Raft: In Search of an Understandable Consensus Algorithm
- ZooKeeper: Wait-free coordination for Internet-scale systems
- Spanner: Google’s Globally-Distributed Database
- Dynamo: Amazon’s Highly Available Key-value Store
- Cassandra – A Decentralized Structured Storage System
- Bigtable: A Distributed Storage System for Structured Data
- FaRM: Fast Remote Memory
- Scaling Memcache at Facebook
- Kafka: A Distributed Messaging System for Log Processing
- The Chubby lock service for loosely-coupled distributed systems
- The Tail at Scale