Full curriculum

159 modules. 2424 pages. Open.

Every chapter, ordered by part. Full-length teaching articles — no stubs, no "coming soon".

Modules
159
Words
727,180
Parts
12
159/ 159
Part 0

Prerequisites

Networking, OS, data structures, databases, APIs. The foundation.

5 chapters2hBeginnerPart overview →
  1. Networking Fundamentals for System Design A practical tour of OSI layers, TCP/UDP, HTTP/1.1 through HTTP/3, DNS, TLS 1.3, and realtime transports like WebSockets and SSE. gRPCWebSocketsCloudflarebeg 25 min
  2. Operating System Essentials for System Design Processes, threads, memory hierarchy with real latency numbers, I/O models (epoll, kqueue, io_uring), and file system internals you need to design systems. RedisNginxKafkabeg 20 min
  3. Data Structures for Distributed Systems The handful of data structures that power modern infrastructure: hash tables, B-trees, LSM-trees, Bloom filters, skip lists, and consistent hashing rings. RedisCassandraScyllaDB+4beg 25 min
  4. Database Fundamentals for System Design SQL, ACID, indexing, query execution, and the normalization vs denormalization debate, covered at the depth you need to design real systems. PostgreSQLMySQLRedis+1beg 25 min
  5. API Design Basics: REST, GraphQL, gRPC, and the Hard Parts Resource modeling, GraphQL's N+1 problem, gRPC streaming, versioning strategies, idempotency keys, and cursor pagination done right. gRPCGraphQLWebSockets+1beg 25 min
Part 1

Core Fundamentals

Scalability, CAP, estimation, interview framework. The vocabulary.

7 chapters3hBeginner to IntermediatePart overview →
  1. Scalability: Growing a System Without Breaking It Vertical vs horizontal scaling, stateless services, read vs write scaling, and when scaling is the wrong answer. The vocabulary every system design conversation… PostgreSQLMySQLRedis+5beg 25 min
  2. Latency and Throughput: The Two Numbers That Matter Latency vs throughput, Jeff Dean's numbers, tail latency (p50/p95/p99), Little's Law, and how to find the real bottleneck in any system. DynamoDBPrometheusbeg 30 min
  3. Availability and Reliability: Nines, SLOs, and Staying Up Availability math, MTBF/MTTR, redundancy patterns, SLI/SLO/SLA, error budgets, dependency math, and real-world outage lessons. DynamoDBS3Cloudflare+1beg 25 min
  4. Consistency Models: What Readers Actually See Strong, eventual, and causal consistency. Read-your-writes, monotonic reads, and the client-centric vs data-centric distinction that makes consistency tractable… DynamoDBSpannerCockroachDB+2int 25 min
  5. Back-of-the-Envelope Estimation Powers of 2 and 10, storage and bandwidth templates, QPS math, and worked examples for Twitter and YouTube scale. The interview skill that matters in production… RedisCassandraScyllaDBbeg 25 min
  6. How to Approach a System Design Question A repeatable 6-step framework for 45-minute system design interviews: clarify, estimate, API, high-level design, deep dive, trade-offs. Minute-by-minute plan wi… beg 25 min
  7. Trade-off Thinking Every design decision is a trade-off. How to articulate, structure, and defend your choices, and how to recognize when not choosing is the right move. CassandraScyllaDBint 25 min
Part 2

Building Blocks

Load balancers, caches, queues, databases, rate limiters.

16 chapters7hIntermediatePart overview →
  1. Load Balancers: Spreading Traffic, Absorbing Failure L4 vs L7 load balancing, algorithms, health checks, and how Envoy, HAProxy, NGINX, and AWS ALB/NLB actually work in production. NginxEnvoyHAProxy+2int 25 min
  2. Reverse Proxies and API Gateways: The Smart Edge TLS termination, routing, auth, rate limiting, and why an API gateway is more than a smarter load balancer. NginxEnvoyKong+2int 30 min
  3. Content Delivery Networks: Moving Bytes Closer to Users How CDNs work, edge caching, cache keys and invalidation, and how Cloudflare, Akamai, and Fastly differ in practice. CloudflareCloudFrontAkamai+1int 30 min
  4. Caching: From Browser to Database Cache hierarchy, five write patterns, eviction policies (LRU/TinyLFU), stampede prevention, and why invalidation is the hardest problem in distributed systems. RedisMemcachedint 25 min
  5. SQL Databases: The Boring Technology That Wins B-trees, MVCC, isolation levels, indexing, and why PostgreSQL and MySQL are still the right answer for most systems. PostgreSQLMySQLCockroachDB+6int 25 min
  6. NoSQL Databases: Picking the Right Non-Relational Tool Key-value, document, wide-column, and graph stores, and how DynamoDB, Cassandra, MongoDB, and Neo4j differ in practice. DynamoDBRedisMongoDB+8int 25 min
  7. Database Partitioning and Sharding: When One Node Is Not Enough Range, hash, and consistent-hash partitioning, hot spots, resharding, and how Notion, Figma, and Discord partition data in practice. PostgreSQLMySQLCassandra+7adv 25 min
  8. Database Replication: Keeping Copies in Sync Leader-follower, multi-leader, and leaderless replication, sync vs async, failover, and how Postgres, MySQL, Cassandra, and DynamoDB differ. PostgreSQLMySQLCassandra+6adv 25 min
  9. Message Queues and Streaming: Decoupling at Scale Queues vs logs, Kafka vs RabbitMQ vs SQS, delivery semantics, partitioning, consumer groups, and when streaming beats request-response. KafkaRabbitMQSQS+4adv 25 min
  10. Pub/Sub: Fan-Out and Event-Driven Systems Topics, subscriptions, fan-out strategies, and how Google Pub/Sub, Redis, NATS, and SNS+SQS implement publish/subscribe at scale. KafkaRedisNATS+1int 25 min
  11. Real-Time Communication: WebSockets, SSE, and Long Polling Long polling, Server-Sent Events, WebSockets, WebRTC, and MQTT: when to use each, how to scale persistent connections, and production lessons from Discord and S… WebSocketsint 25 min
  12. Rate Limiting: Protecting Systems from Themselves Token bucket, leaky bucket, fixed and sliding windows, distributed rate limiting, and how Stripe, Cloudflare, and GitHub protect their APIs. RedisCloudflareEnvoyint 25 min
  13. Service Discovery and Service Mesh: Finding and Talking to Services DNS, client-side and server-side discovery, health checks, sidecars, mTLS, and what Consul, Envoy, Istio, and Linkerd actually solve. EnvoyIstioConsul+3adv 25 min
  14. Blob and Object Storage: Storing the Big Stuff S3 semantics, object storage internals, multipart uploads, lifecycle policies, and when to pick S3 vs GCS vs MinIO vs a filesystem. S3CloudflareCloudFrontint 25 min
  15. Geospatial Indexing: Geohash, Quadtree, R-tree, S2, and H3 Space-filling curves, hierarchical grids, and tree indexes for location queries - with decision guidance on geohash vs quadtree vs R-tree vs S2 vs H3. RedisPostgreSQLElasticsearchint 25 min
  16. Edge Computing (Cloudflare Workers, Lambda@Edge, Deno Deploy) Design applications for the edge: cold starts, state replication with Durable Objects, edge databases, and the limits of running code close to users. int 25 min
Part 3

Distributed Systems Theory

Consensus, CRDTs, clocks, consistent hashing. The theory.

11 chapters5hIntermediate to AdvancedPart overview →
  1. Consensus Protocols: How Distributed Systems Agree Raft and Multi-Paxos explained: leader election, log replication, safety under term numbers, and why etcd, Consul, and CockroachDB picked Raft. etcdCockroachDBZooKeeper+1int 30 min
  2. Consistency Deep Dive: Linearizability, Serializability, and the Spectrum Between Linearizability vs serializability vs causal vs eventual, external consistency, and how to reason precisely about what your database actually gives you. SpannerCockroachDBPostgreSQL+2adv 25 min
  3. Quorums and Replication: The Math of R + W > N Read and write quorums, Dynamo-style replication, sloppy quorums, hinted handoff, and when quorums give you linearizability (and when they do not). CassandraDynamoDBScyllaDBint 25 min
  4. CAP and PACELC: The Tradeoff That Keeps Confusing People What CAP actually says (and what it doesn't), the three big misconceptions, and how PACELC fixes the omission of latency. CassandraDynamoDBMongoDB+5int 25 min
  5. Clocks and Ordering: Lamport, Vector, and Hybrid Logical Clocks Why wall-clock time lies, happens-before, Lamport clocks, vector clocks, hybrid logical clocks, and Google TrueTime. SpannerCockroachDBCassandra+2adv 30 min
  6. CRDTs: Conflict-Free Replicated Data Types State-based (CvRDT) and op-based (CmRDT) CRDTs, G-Counter, PN-Counter, OR-Set, LWW-Register, and how Yjs, Automerge, Figma, and Redis use them. Redisadv 25 min
  7. Distributed Transactions: 2PC, Saga, and When to Avoid Both Two-phase commit, Percolator, Sagas, outbox pattern, and the honest answer to distributed transactions: don't, or pay the price. SpannerCockroachDBKafka+3adv 30 min
  8. Idempotency and Exactly-Once: The Honest Truth About Delivery Guarantees Why exactly-once delivery is a myth, how idempotency keys make at-least-once feel exactly-once, and how Stripe and Kafka implement it. KafkaRedisSQS+1int 25 min
  9. Failure Detection: Deciding a Node Is Dead Heartbeats, phi-accrual detectors, gossip (SWIM), and why 'is it dead or just slow?' has no correct answer. CassandraConsulRedis+3int 25 min
  10. Consistent Hashing: Keys to Nodes Without Global Reshuffles The hash ring, virtual nodes, bounded-load variant, rendezvous hashing, Maglev, and jump hash for distributing keys across dynamic node sets. CassandraMemcachedEnvoy+3int 25 min
  11. Merkle Trees and Anti-Entropy: Keeping Replicas in Sync Cheaply Merkle trees, anti-entropy protocols, read repair, hinted handoff, and how Dynamo, Cassandra, and Git use hashed trees to find the differences. CassandraDynamoDBadv 25 min
Part 4

Data Systems

Storage engines, OLAP, streams, search, vectors.

10 chapters4hIntermediate to AdvancedPart overview →
  1. Storage Engines: B-Trees, LSM-Trees, and Why Your Database Feels the Way It Does How B-tree and LSM-tree storage engines shape read, write, and space amplification, with examples from InnoDB, PostgreSQL, RocksDB, and Cassandra. MySQLPostgreSQLCassandra+3adv 25 min
  2. OLTP vs OLAP: Row Stores, Column Stores, and Matching Shape to Workload Why transactional systems use row-oriented storage and analytical systems use columnar, with examples from Postgres, MySQL, Redshift, BigQuery, ClickHouse, and … PostgreSQLMySQLBigQuery+6int 25 min
  3. Data Warehouses and Data Lakes: Structure, Schema, and the Lakehouse How Redshift, BigQuery, Snowflake, S3-based lakes, and the lakehouse pattern with Delta Lake, Iceberg, and Hudi actually fit together. BigQueryS3Spark+6int 25 min
  4. Stream vs Batch Processing: Lambda, Kappa, and the End of That Debate Batch with Spark and Hadoop, streaming with Kafka Streams, Flink, and Spark Streaming, and how Lambda and Kappa architectures stack up. KafkaFlinkSpark+2int 25 min
  5. Change Data Capture: Streaming the Database's Inner Monologue How Debezium, Maxwell, and the outbox pattern turn WAL and binlog entries into reliable event streams, and when each approach is the right call. PostgreSQLMySQLKafka+2int 25 min
  6. Search Systems: Inverted Indexes, BM25, and Running Elasticsearch in Production How Elasticsearch, OpenSearch, and Solr build inverted indexes, score with BM25, and handle faceting, relevance tuning, and sharding at scale. ElasticsearchOpenSearchPostgreSQL+1int 30 min
  7. Time-Series Databases: Metrics, Events, and Retention at Scale How Prometheus, InfluxDB, TimescaleDB, and VictoriaMetrics handle write-heavy time-series workloads with downsampling and retention policies. PrometheusGrafanaInfluxDB+2int 25 min
  8. Graph Databases: Property Graphs, Cypher, and When Joins Are the Problem How Neo4j, Amazon Neptune, and Dgraph model relationships, and when graph queries beat recursive SQL joins. Neo4jRocksDBint 25 min
  9. Vector Databases: Embeddings, ANN Indexes, and the Retrieval Layer for AI How Pinecone, Weaviate, Milvus, and pgvector store and search embeddings using HNSW and IVF approximate nearest neighbor indexes. PineconeWeaviateMilvus+6adv 25 min
  10. Key-Value Stores: Redis, Memcached, DynamoDB, and Picking the Right Hash Table How Redis, Memcached, and DynamoDB differ in durability, data model, and scaling, and when each is the right key-value store. RedisMemcachedDynamoDB+1int 25 min
Part 5

Architecture Patterns

Microservices, event-driven, CQRS, multi-region.

11 chapters4hIntermediatePart overview →
  1. Monolith vs Microservices: Team Topology, Conway's Law, and the Distributed System Tax When a modular monolith beats microservices, how Conway's Law shapes architecture, and what the distributed system tax actually costs you. int 25 min
  2. Event-Driven Architecture: Notifications, State Transfer, and Choreography The three flavors of events, how Kafka and event buses enable loose coupling, and when choreography beats orchestration. KafkaFlinkDebezium+1int 25 min
  3. CQRS: Separating Reads from Writes Without Losing Your Mind Command Query Responsibility Segregation in practice: when to split read and write models, how to handle eventual consistency, and when CQRS is overkill. KafkaPostgreSQLElasticsearch+6int 20 min
  4. Event Sourcing: Events as the Source of Truth Storing state as an append-only log of events, with replay, projections, snapshots, and the ops reality of running event-sourced systems. PostgreSQLKafkaDynamoDB+1adv 25 min
  5. Serverless: Functions, Cold Starts, and When FaaS Actually Saves Money AWS Lambda, Google Cloud Functions, and Azure Functions in practice: cold starts, concurrency models, and the honest economics of serverless. DynamoDBS3Cloudflare+1int 20 min
  6. Backend for Frontend: Per-Client API Aggregation Done Right When one API cannot serve web, mobile, and partner clients well, the BFF pattern gives each client its own aggregation layer. GraphQLint 25 min
  7. Strangler Fig: Incremental Migration Without a Big Bang Martin Fowler's strangler fig pattern for replacing legacy systems incrementally, with routing, facades, and how teams actually execute multi-year migrations. KafkaNginxEnvoy+2int 30 min
  8. Hexagonal and Clean Architecture: Keeping Business Logic Independent Ports and adapters, clean architecture, and onion architecture: how to keep domain logic testable and framework-independent. int 20 min
  9. Multi-Region Architecture: Active-Passive, Active-Active, and CRDTs Designing systems that survive regional failure: DNS failover, active-passive replication, active-active with CRDTs, and Cloudflare's model. CassandraDynamoDBSpanner+3adv 25 min
  10. Multi-Tenancy: Silo, Pool, and the SaaS Isolation Spectrum Designing SaaS platforms that host many tenants on shared infrastructure: isolation levels, noisy-neighbor defenses, per-tenant metering, and when to graduate a… PostgreSQLMySQLRedis+1int 25 min
  11. CRDT Applications (Yjs, Automerge, Local-First Software) Design local-first collaborative software with CRDTs: Yjs, Automerge, peer-to-peer sync, and the architectural shift away from authoritative central servers. adv 25 min
Part 6

Reliability and Operations

Observability, SLOs, chaos, deployment strategies.

11 chapters5hIntermediate to AdvancedPart overview →
  1. Observability: Metrics, Logs, Traces, and the OpenTelemetry Standard The three pillars of observability, USE vs RED methods, and how OpenTelemetry, Prometheus, Grafana, and Jaeger fit together in production. PrometheusGrafanaJaeger+1adv 25 min
  2. SLI, SLO, SLA, and Error Budgets: Making Reliability Quantitative The Google SRE framework for reliability: what to measure, what to target, what to promise, and how error budgets fund feature work vs reliability work. PrometheusGrafanaint 25 min
  3. Resilience Patterns: Timeouts, Retries, Circuit Breakers, and Bulkheads The defensive patterns that keep distributed systems from cascading into total failure, from Hystrix to modern service mesh implementations. EnvoyIstioint 25 min
  4. Graceful Degradation: When Partial Service Beats No Service Load shedding, feature flags, cached fallbacks, and the product-engineering decisions behind degrading one feature to save the system. Cloudflareint 25 min
  5. Auto-Scaling and Capacity Planning: From HPA to Predictive Scaling Horizontal pod autoscalers, cluster autoscalers, predictive scaling, and the capacity planning math that keeps systems sized right without overspending. KafkaPrometheusCloudflareint 25 min
  6. Deployment Strategies: Blue-Green, Canary, Rolling, and Feature Flags How to ship changes safely with blue-green, rolling, canary, and progressive delivery, plus the role of feature flags and LaunchDarkly-style tooling. int 25 min
  7. Chaos Engineering: Breaking Things on Purpose Netflix's Chaos Monkey, the Principles of Chaos, and how to run game days and fault injection experiments without making your on-call call in sick. Istioadv 25 min
  8. Incident Management: From Detection to Blameless Postmortem On-call, incident command, severity levels, communication, and how to run postmortems that actually change systems instead of blaming people. Cloudflareint 25 min
  9. Health Checks and Readiness: Telling the Truth About Whether You're Up Liveness, readiness, startup probes, deep vs shallow health checks, and why bad health checks cause more outages than bad code. EnvoyIstioConsulint 20 min
  10. Cost Optimization and FinOps Apply FinOps to reduce cloud bills without sacrificing reliability: spot instances, reserved capacity, autoscaling, storage tiering, and unit-economics thinking… int 25 min
  11. Platform Engineering: IDPs, Golden Paths, and DX Treat the platform as a product: build internal developer platforms with Backstage, golden paths, and DORA/SPACE metrics that move developer productivity. adv 25 min
Part 7

Security at Scale

OAuth2, JWT, mTLS, DDoS.

10 chapters4hIntermediate to AdvancedPart overview →
  1. Authentication vs Authorization: Identity, Permissions, and Access Models AuthN vs AuthZ, session vs token auth, and access control models: RBAC, ABAC, ReBAC with examples from SpiceDB, OpenFGA, and AWS IAM. Spannerint 25 min
  2. OAuth 2.0 and OpenID Connect: Delegated Authorization and Identity Done Right The OAuth 2.0 authorization framework, OIDC identity layer, and the Authorization Code + PKCE flow that is the modern standard for web and mobile. adv 25 min
  3. JWT Deep Dive: Signed Tokens, Claims, and the Revocation Problem How JSON Web Tokens work: JWS signing, JWE encryption, claim validation, key rotation, and the trade-offs of stateless auth. int 25 min
  4. mTLS and Service-to-Service Authentication: SPIFFE, Service Mesh, and Zero Trust How mutual TLS, SPIFFE/SPIRE, and service meshes like Istio and Linkerd authenticate services without long-lived credentials. EnvoyIstioConsul+1adv 25 min
  5. Secrets Management: Vault, KMS, and the End of Secrets in Config Files Managing API keys, passwords, and certificates with Vault, AWS Secrets Manager, KMS envelope encryption, and dynamic secrets. int 25 min
  6. DDoS Protection and WAFs: Mitigating Volumetric and Application Attacks Defending against L3/L4 and L7 DDoS with Cloudflare, AWS Shield, and WAFs; rate limiting, bot management, and the OWASP Top 10. CloudflareAkamaiEnvoy+1int 25 min
  7. Data Residency and Compliance Architecture (GDPR, DPDP, CCPA, Right-to-Erasure) Designing multi-jurisdictional systems for GDPR, DPDP, CCPA, and LGPD with data classification, regional silos, crypto-shredding, and auditable erasure. adv 30 min
  8. Supply Chain Security: SBOM, SLSA, Sigstore, and Defending Against xz-utils Protecting the software supply chain with SBOMs, SLSA provenance, Sigstore signing, admission policies, and lessons from xz-utils and SolarWinds. adv 25 min
  9. Privacy-Preserving Systems (Differential Privacy, Federated Learning) Design systems that protect user data by construction: differential privacy, federated learning, secure aggregation, and an introduction to homomorphic encrypti… adv 30 min
  10. Post-Quantum Cryptography: Migrating to ML-KEM, ML-DSA, and a Crypto-Agile Future Why harvest-now-decrypt-later makes PQC urgent, what NIST standardized in 2024, and how to migrate production TLS and long-lived secrets to hybrid post-quantum … adv 25 min
Part 8

Case Studies

56 end-to-end system designs.

56 chapters29hIntermediate to AdvancedPart overview →
  1. Design a URL Shortener (TinyURL / bit.ly) An interview-grade walkthrough for a URL shortener: capacity estimation, short-code generation, hot-key caching, and an analytics pipeline that never blocks the… DynamoDBRedisCloudflare+5int 30 min
  2. Design a Pastebin (Paste Sharing Service) An interview-grade walkthrough for a Pastebin-style text sharing service: object storage split, TTL-based expiration pipelines, syntax highlighting placement, a… S3DynamoDBRedis+3int 30 min
  3. Design a Distributed Rate Limiter An interview-grade walkthrough for a distributed API rate limiter: algorithm choice, Redis Lua atomicity, two-tier local+global synchronization, and fail-open f… RedisEnvoyNginx+3int 30 min
  4. Design a Distributed Key-Value Store (Dynamo / Cassandra / Riak) Design a distributed KV store with consistent hashing, quorum replication, gossip membership, hinted handoff, and Merkle-tree anti-entropy repair. CassandraScyllaDBDynamoDBadv 30 min
  5. Design a Notification System (Push, SMS, Email at Scale) An interview-grade walkthrough for a multi-channel notification platform: fan-out architecture, APNs/FCM integration, retry with dead-letter queues, and device-… KafkaRedisCassandra+3int 35 min
  6. Design a Chat System (WhatsApp / Messenger / Signal) Staff-level design for 1:1 and small-group chat at WhatsApp scale: 500M concurrent connections, message ordering, E2E encryption, and storage model trade-offs. ScyllaDBCassandraRedis+7adv 35 min
  7. Design a Social Media Feed (Twitter / Instagram / LinkedIn) Fan-out architecture, hybrid push/pull, ML ranking pipelines, and the celebrity problem at Twitter/X scale. RedisKafkaCassandra+4adv 35 min
  8. Design a Photo Sharing Service (Instagram) Design Instagram-scale photo sharing: upload pipeline, transcoding, multi-resolution image serving, CDN with origin shielding, and news feed integration. PostgreSQLS3CloudFront+4int 30 min
  9. Design a Web Crawler (Googlebot-style) Design a distributed web crawler with URL frontier, politeness policies, content deduplication, robots.txt compliance, and Bloom-filter-backed URL dedup at bill… KafkaCassandraHBase+3int 30 min
  10. Design Search Autocomplete (Typeahead Suggestions) Design a low-latency autocomplete system with tries, top-K precomputation, real-time trending overlays, and multi-tier caching at Google scale. RedisElasticsearchKafka+3int 30 min
  11. Design a Video Streaming Service (YouTube / Twitch / TikTok) Design a UGC video platform from upload through adaptive bitrate streaming: transcode pipeline, HLS/DASH/CMAF packaging, CDN delivery with ISP peering, and live… S3KafkaCloudFront+3int 35 min
  12. Design Netflix (End-to-End) A whole-system walkthrough of Netflix's architecture: microservices, Open Connect CDN, per-title encoding, Cassandra + EVCache, resilience patterns, and chaos e… CassandraMemcachedKafka+4adv 35 min
  13. Design a Ride-Hailing Service (Uber / Lyft) An interview-grade walkthrough for Uber-scale ride-hailing: H3 geospatial indexing, real-time location ingest, batched bipartite matching, surge pricing, and tr… KafkaFlinkCassandra+3int 35 min
  14. Design Google Maps (Routing and Tile Rendering) Design planet-scale mapping: map tile rendering, shortest-path routing with contraction hierarchies, ETA prediction with GNNs, and offline maps. RedisKafkaFlink+3int 35 min
  15. Design a File Sync Service (Dropbox / Google Drive) Design a Dropbox-style file sync service: block-level deduplication, delta sync, conflict resolution, versioning, and client-server reconciliation. MySQLS3Kafka+1int 30 min
  16. Design Collaborative Editing (Google Docs / Figma / Notion) Staff-level design for real-time collaborative editing at 100K+ concurrent editors: OT vs CRDTs, presence broadcasting, offline sync, and version history. WebSocketsPostgreSQLRedis+1adv 35 min
  17. Design a Distributed Cache (Memcached / Redis Cluster) Design a Memcached- or Redis-style distributed cache: consistent hashing, eviction, replication, client-side sharding, and hot-key mitigation. RedisMemcachedKafkaint 30 min
  18. Design a Recommendation System (Netflix / YouTube / TikTok) Design a two-stage recommendation system: candidate generation, ranking, collaborative filtering, content-based features, a feature store, and cold-start handli… KafkaFaissRedis+1adv 30 min
  19. Design a Ticketing System (BookMyShow / Ticketmaster) An interview-grade walkthrough for high-concurrency ticketing: seat locking with Redis SETNX, saga-based payment, virtual waiting rooms, and anti-bot defenses a… RedisPostgreSQLKafka+5int 30 min
  20. Design a Payment System (Stripe / PayPal) Design a payment system with a double-entry ledger, idempotency keys, saga-orchestrated cross-service flows, and the compliance constraints that shape every dec… PostgreSQLRedisKafka+3adv 30 min
  21. Design a Stock Exchange (Matching Engine) Design a deterministic, low-latency matching engine: FIFO order book, price-time priority, multicast market data distribution, and co-location realities. adv 30 min
  22. Design a Food Delivery Service (DoorDash / Swiggy) Design a three-sided food delivery marketplace: dispatch with batching and reassignment, composite ETA prediction, and driver-merchant-customer coordination. KafkaRedisCassandra+3adv 30 min
  23. Design a Metrics Pipeline (Prometheus / InfluxDB / Thanos) Design a time-series metrics pipeline: high-cardinality ingestion, aggregation across clusters with Thanos or Cortex, alerting, and downsampling for long-term r… PrometheusGrafanaOpenTelemetry+3adv 30 min
  24. Design Ad-Click Aggregation (Real-Time Stream Processing) Design an ad-click aggregation system with exactly-once semantics on Kafka + Flink, real-time fraud detection, and low-latency dashboards. KafkaFlinkRedis+6adv 30 min
  25. Design a Logging Platform (ELK / Loki / Splunk) Design a logging platform: ingestion at scale, index vs. label-based storage (Elastic vs Loki), retention tiering, and full-text search with BM25. ElasticsearchKafkaClickHouse+3int 30 min
  26. Design a Proximity Service (Nearby Friends / Yelp) Design a proximity service for 100M users with 1M concurrent location-sharing sessions: geohash/H3/S2 trade-offs, Redis geosets, bounding-box query fan-out, pri… RedisKafkaPostgreSQLadv 30 min
  27. Design a Real-Time Leaderboard Design a real-time leaderboard for 10M players with 100K score updates/sec, tie-breaking, time-windowed views, friend boards, and approximate rank for the tail. RedisKafkaDynamoDB+1int 30 min
  28. Design a Unique ID Generator (Snowflake, ULID, TSID, UUIDv7) Design a distributed ID generator producing 10M 64-bit IDs/sec, monotonic-ish ordering, clock-skew resilience, and the four-way trade-off between Snowflake, ULI… PostgreSQLMySQLZooKeeper+1adv 30 min
  29. Design a Hotel Reservation System (Booking.com / Airbnb) Staff-level design for hotel reservation: search/booking split, PostgreSQL exclusion constraints for date-range double-booking prevention, Temporal saga orchest… PostgreSQLRedisElasticsearch+4adv 30 min
  30. Design a Distributed Job Scheduler (Airflow / Temporal / Distributed Cron) Design a scheduler for 100k registered jobs and 10k executions/sec with exactly-once execution, DAGs up to 10k nodes, late/missed run policies, and graceful sch… KafkaCassandraRedis+4adv 30 min
  31. Design ChatGPT (Conversational AI at Scale) Design ChatGPT for 900M weekly users: multi-tenant LLM serving, session-state architecture, streaming SSE, per-user memory, safety, and multi-region deployment. PostgreSQLRedisCloudflare+1adv 35 min
  32. Design an Enterprise RAG System Design a multi-tenant enterprise RAG platform for 1k tenants with 10M documents each at 100 QPS/tenant: ingestion, hybrid retrieval, reranking, citation, access… PineconeWeaviateKafka+4adv 30 min
  33. Design a Coding Agent (Claude Code / GitHub Copilot / Cursor) Design a coding agent serving 1M concurrent sessions across autocomplete, chat, and autonomous loop modes with repo indexing, sandboxed tool use, and streaming … RedisPostgreSQLgRPC+1adv 30 min
  34. Design Perplexity (AI Search with Citations) Design an AI search engine for 50M MAU, 5k QPS peak, <2 s answer latency with inline citations: query rewriting, source retrieval, citation-grounding, streaming… RedisCloudflareadv 30 min
  35. Design a Voice Agent (Alexa / Siri-Class Realtime) Design a realtime voice agent for 100M devices with 50k concurrent conversations and sub-700 ms turn latency: streaming ASR, LLM turn-taking, streaming TTS, Web… RedisWebSocketsadv 30 min
  36. Design a Content Moderation System at Scale Design moderation for 500M posts/day with <200 ms pre-publish latency, human-in-loop for 0.5% of traffic, multi-modal (text+image+video): classifier cascade, re… RedisKafkaClickHouseadv 30 min
  37. Design a Semantic Cache for LLM Applications Design an embedding-similarity cache for LLM prompts at 10k QPS, 70%+ hit rate, <10 ms lookup: similarity threshold calibration, invalidation on source change, … RedisMilvusKafka+2adv 30 min
  38. Design a Model Router and Gateway (OpenRouter / LiteLLM) Design a gateway routing 20k QPS across 50+ models and 10 providers with <30 ms routing overhead: cost/latency/quality routing strategies, provider failover, st… RedisKafkaClickHouse+1adv 30 min
  39. Design a Feature Flag Service (LaunchDarkly / Harness FME / Unleash) Design a feature flag and experimentation platform for 20T evaluations/day with sub-millisecond SDK-side latency, streaming config distribution, and sub-60s kil… PostgreSQLKafkaClickHouse+5adv 30 min
  40. Design a DNS Service (Cloudflare 1.1.1.1 / Google 8.8.8.8) Design a public recursive DNS resolver serving trillions of queries/day globally with <20 ms p99 from 300+ anycast POPs: UDP/TCP/DoT/DoH/DoQ, DNSSEC validation,… Cloudflareadv 30 min
  41. Design a Dating App (Tinder / Hinge / Bumble) Design a dating app for 100M MAU handling 1.5B swipes/day with <50 ms card loads: two-tower recommendations, geospatial filtering, mutual-match detection, and s… RedisKafkaDynamoDB+3adv 30 min
  42. Design an Online Auction (eBay / Catawiki) Design an online auction for 100M active listings and 10M concurrent bidders: Redis Lua CAS for atomic bids at 1M/sec peak, proxy bidding, sniping extensions, a… RedisKafkaWebSockets+2adv 30 min
  43. Design a Multi-Tenant SaaS Platform Design a multi-tenant SaaS platform serving 50K tenants with per-tenant SLA tiers, metered billing, noisy-neighbor containment, and zero cross-tenant data leaka… PostgreSQLRedisKafka+3adv 30 min
  44. Design a Video Conferencing System (Zoom / Google Meet) Design a video conferencing platform for 500K simultaneous meetings and 10M concurrent participants with <150 ms audio and <500 ms video: SFU vs MCU vs P2P, sim… RedisPostgreSQLS3+2adv 30 min
  45. Design an Email Service at Gmail Scale (1.8B Users, 300B Messages/Day) Design a global email service for 1.8B users and 300B messages/day: SMTP ingress, spam pipeline cascade, per-user sharded search, RFC 5322 threading, and exabyt… BigtableKafkaElasticsearch+3adv 30 min
  46. Design Live Comments at Scale (FB Live / YouTube Live / Twitch Chat) Design a live-comment system for 10M concurrent viewers and 100K commenters on one stream: delta-batched fan-out, pre-publish moderation, and the celebrity-stre… RedisKafkaWebSocketsadv 30 min
  47. Design a Fraud Detection System (Stripe Radar / PayPal / Feedzai) Design a real-time fraud detection service scoring millions of events per second under 100 ms p99 with a rules-ML cascade, online/offline feature store, graph r… KafkaFlinkRedis+3adv 30 min
  48. Design a Fitness Tracking Service (Strava / MapMyRun) Design a fitness tracking service for 195M+ users: GPS ingestion, two-stage segment matching with H3 pre-filter and DTW, Kafka-backed leaderboards, and privacy-… RedisKafkaSpark+3adv 25 min
  49. Design an Online Judge (LeetCode / Codeforces / HackerEarth) Design an online judge for 1M users at 100K submissions/hour peak with <15s verdict p99: Firecracker sandboxes, priority queueing for contests, seccomp + cgroup… PostgreSQLRedisRabbitMQ+1adv 30 min
  50. Design a Price Tracking Service (CamelCamelCamel / Honey / Keepa) Design a price-tracking service that watches 100M product URLs with priority-driven scraping, diff-based alerting to 10M subscribers, 2-year historical retentio… KafkaRedisPostgreSQL+4int 30 min
  51. Design an API Gateway at Scale (Kong / AWS API Gateway / Apigee / Envoy) Design an API gateway that handles 100K RPS per instance with <5 ms p99 overhead across 10K upstream services: routing trie, local+global rate limiting, mTLS te… EnvoyKongNginx+5adv 30 min
  52. Design a CI/CD Platform (GitHub Actions / GitLab CI / CircleCI) Design a CI/CD platform for 100K orgs and 10M workflow runs/day: YAML DAG execution, ephemeral runner pools (Firecracker), content-addressed artifacts, dependen… KafkaS3PostgreSQL+1adv 30 min
  53. Design an Observability Platform (Datadog / New Relic / Honeycomb) Design a unified observability platform for 10M hosts and 1B events/sec: OTLP ingestion across metrics, logs, and traces, cardinality control, trace-log correla… PrometheusGrafanaOpenTelemetry+5adv 30 min
  54. Design a Search Engine (Google-Scale / Brave Search) Design a web-scale search engine over 10B documents serving 100K queries/sec at <200 ms p99: sharded inverted index, BM25 + PageRank + neural re-ranking cascade… ElasticsearchKafkaBigtable+3adv 30 min
  55. Design a Brokerage Platform (Robinhood / E*TRADE / Interactive Brokers) Design a retail brokerage for 30M users: order routing, symbol-channel quote fanout, fractional-share aggregation, tax-lot accounting, and seven-year regulatory… PostgreSQLKafkaRedis+4adv 30 min
  56. Design Channel-Scale Chat (Discord / Slack) Design channel-scale chat for 100K+ member channels with pub/sub fanout, RBAC on the hot path, presence aggregation, and workspace search. ScyllaDBCassandraRedis+6adv 30 min
Part 9

AI & ML System Design

LLM serving, RAG, agents, multi-agent orchestration, evaluation, cost, safety, ML fundamentals, feature stores, recommendations, multimodal, voice.

15 chapters7hIntermediate to AdvancedPart overview →
  1. LLM Serving Architecture (vLLM, TGI, TensorRT-LLM) Design a production LLM inference stack: continuous batching, paged attention, KV-cache management, and multi-tenant GPU scheduling. adv 25 min
  2. RAG Pipelines (Retrieval-Augmented Generation) Design production RAG: chunking, embedding models, hybrid dense-plus-sparse retrieval, reranking, and the eval loops that keep it honest. int 25 min
  3. Vector Search at Scale (HNSW, IVF-PQ, DiskANN) Design billion-scale vector search: HNSW, IVF-PQ, and DiskANN indexes, product quantization, hybrid BM25-vector search, and sharding strategies. adv 25 min
  4. AI Agent Architectures (ReAct, Reflection, Planning, Tool Use, Memory) The canonical patterns for turning an LLM into an agent: ReAct's think-act-observe loop, reflection and self-critique, planner-executor decomposition, tool use … adv 25 min
  5. Multi-Agent Orchestration (LangGraph, OpenAI Agents SDK, AutoGen, Swarm) Composing multiple agents into a reliable system: orchestrator-worker topologies, handoffs and delegation, shared memory, parallel fan-out, and the failure mode… adv 25 min
  6. LLM Evaluation and Observability (Ragas, LangSmith, TruLens, LLM-as-Judge) How to evaluate LLM systems before and after they ship: golden datasets, reference-free metrics, LLM-as-judge, continuous eval pipelines, and the observability … adv 25 min
  7. LLMOps and Prompt Engineering (Versioning, Guardrails, Red-Teaming) The operational side of shipping LLM features: prompt-as-code, versioning, rollback, A/B testing prompts, structured outputs, and red-teaming before launch. int 30 min
  8. LLM Cost Optimisation (Semantic Cache, Model Routing, Cascading, Prompt Caching) The cost-engineering toolbox for production LLMs: semantic caching, model routing, cascade small-then-big, prompt caching (Anthropic, OpenAI), and the unit econ… int 30 min
  9. LLM Safety and Guardrails (OWASP LLM Top 10, Prompt Injection, PII, Jailbreaks) The safety-engineering surface for LLM applications: OWASP LLM Top 10, prompt-injection defence, PII redaction, jailbreak containment, and the defence-in-depth … adv 25 min
  10. ML System Design Fundamentals The classic ML systems backbone every modern AI product sits on: candidate generation, ranking, two-tower embeddings, offline/online feature parity, and the tra… int 25 min
  11. Feature Stores and Model Serving (Feast, Tecton, KServe, BentoML, MLflow) The infrastructure that makes ML shippable: online and offline feature stores, the model registry, model servers, shadow deploys, and the production lifecycle a… adv 30 min
  12. Recommendation Systems Deep Dive (DLRM, Two-Tower, Embedding Retrieval, Cold Start) How modern recommenders actually work end-to-end: candidate gen via ANN on embeddings, DLRM-style ranking, exploration-exploitation, cold-start handling, and th… adv 25 min
  13. Realtime AI and Voice Agents (Streaming Inference, WebRTC, LiveKit, Deepgram) Designing sub-second voice agents: streaming ASR, low-latency LLM inference, streaming TTS, WebRTC transport, interruption handling, and the end-to-end latency … adv 25 min
  14. Multimodal AI Systems (CLIP, Whisper, LayoutLM, Document AI) Designing systems that ingest images, audio, video, and documents: CLIP-style embeddings for cross-modal retrieval, Whisper pipelines, OCR-plus-layout models, a… int 25 min
  15. Data Infrastructure for AI (Embedding Pipelines, Chunking, Unstructured ETL, MCP) The data plane that feeds AI systems: source connectors, chunking strategies, embedding at scale, metadata schema, freshness, and the Model Context Protocol as … int 30 min
Part 10

Emerging Patterns

Green computing and forward-looking topics that have not yet settled into a canonical home. Slim by design: new primitives land here first, then graduate into the relevant Part once they mature.

1 chapter1hIntermediate to AdvancedPart overview →
  1. Green Computing (Carbon-Aware Scheduling, PUE, Sustainable Systems) Design systems that account for carbon: carbon-aware scheduling, PUE, renewable-energy datacenters, and the Green Software Foundation practices. int 25 min
Part 11

Interview Framework

RESHADED, diagramming, trade-off articulation, company-specific flavours.

6 chapters2hIntermediatePart overview →
  1. Interview Frameworks Compared (RESHADED, PEDALS, ADEPT) Compare the major system-design interview frameworks and pick the one you will use in every interview. int 25 min
  2. Requirements Scoping: Functional, Non-Functional, and MoSCoW Master the first five minutes of a system design interview: functional vs non-functional requirements, MoSCoW prioritization, and time-boxing the scope. int 30 min
  3. Diagramming Skills for System Design Interviews Build whiteboard and virtual diagrams that interviewers can read: consistent notation, flow direction, and the tools (Excalidraw, Miro) that the industry prefer… int 25 min
  4. Trade-off Articulation: Saying 'It Depends' Well Learn to verbalize design trade-offs the way senior engineers do: name the axis, state the dependency, and commit to a choice. int 20 min
  5. Company-Specific Interview Flavors (Amazon, Google, Meta, Netflix) How the same system design question looks different at Amazon, Google, Meta, and Netflix: leadership principles, scale emphasis, product sense, and simplicity. int 25 min
  6. Design Doc Authoring: RFCs, ADRs, and the Staff Engineer's Written Output How to write ADRs, design docs, and RFCs that drive alignment, record decisions, and demonstrate Staff+ engineering judgment. adv 25 min