159 modules. 2424 pages. Open.
Every chapter, ordered by part. Full-length teaching articles — no stubs, no "coming soon".
- Modules
- 159
- Words
- 727,180
- Parts
- 12
No chapters match those filters
Try clearing one of the active filters — or reset them all.
Prerequisites
Networking, OS, data structures, databases, APIs. The foundation.
- Networking Fundamentals for System Design A practical tour of OSI layers, TCP/UDP, HTTP/1.1 through HTTP/3, DNS, TLS 1.3, and realtime transports like WebSockets and SSE. beg 25 min
- Operating System Essentials for System Design Processes, threads, memory hierarchy with real latency numbers, I/O models (epoll, kqueue, io_uring), and file system internals you need to design systems. beg 20 min
- Data Structures for Distributed Systems The handful of data structures that power modern infrastructure: hash tables, B-trees, LSM-trees, Bloom filters, skip lists, and consistent hashing rings. beg 25 min
- Database Fundamentals for System Design SQL, ACID, indexing, query execution, and the normalization vs denormalization debate, covered at the depth you need to design real systems. beg 25 min
- API Design Basics: REST, GraphQL, gRPC, and the Hard Parts Resource modeling, GraphQL's N+1 problem, gRPC streaming, versioning strategies, idempotency keys, and cursor pagination done right. beg 25 min
Core Fundamentals
Scalability, CAP, estimation, interview framework. The vocabulary.
- Scalability: Growing a System Without Breaking It Vertical vs horizontal scaling, stateless services, read vs write scaling, and when scaling is the wrong answer. The vocabulary every system design conversation… beg 25 min
- Latency and Throughput: The Two Numbers That Matter Latency vs throughput, Jeff Dean's numbers, tail latency (p50/p95/p99), Little's Law, and how to find the real bottleneck in any system. beg 30 min
- Availability and Reliability: Nines, SLOs, and Staying Up Availability math, MTBF/MTTR, redundancy patterns, SLI/SLO/SLA, error budgets, dependency math, and real-world outage lessons. beg 25 min
- Consistency Models: What Readers Actually See Strong, eventual, and causal consistency. Read-your-writes, monotonic reads, and the client-centric vs data-centric distinction that makes consistency tractable… int 25 min
- Back-of-the-Envelope Estimation Powers of 2 and 10, storage and bandwidth templates, QPS math, and worked examples for Twitter and YouTube scale. The interview skill that matters in production… beg 25 min
- How to Approach a System Design Question A repeatable 6-step framework for 45-minute system design interviews: clarify, estimate, API, high-level design, deep dive, trade-offs. Minute-by-minute plan wi… beg 25 min
- Trade-off Thinking Every design decision is a trade-off. How to articulate, structure, and defend your choices, and how to recognize when not choosing is the right move. int 25 min
Building Blocks
Load balancers, caches, queues, databases, rate limiters.
- Load Balancers: Spreading Traffic, Absorbing Failure L4 vs L7 load balancing, algorithms, health checks, and how Envoy, HAProxy, NGINX, and AWS ALB/NLB actually work in production. int 25 min
- Reverse Proxies and API Gateways: The Smart Edge TLS termination, routing, auth, rate limiting, and why an API gateway is more than a smarter load balancer. int 30 min
- Content Delivery Networks: Moving Bytes Closer to Users How CDNs work, edge caching, cache keys and invalidation, and how Cloudflare, Akamai, and Fastly differ in practice. int 30 min
- Caching: From Browser to Database Cache hierarchy, five write patterns, eviction policies (LRU/TinyLFU), stampede prevention, and why invalidation is the hardest problem in distributed systems. int 25 min
- SQL Databases: The Boring Technology That Wins B-trees, MVCC, isolation levels, indexing, and why PostgreSQL and MySQL are still the right answer for most systems. int 25 min
- NoSQL Databases: Picking the Right Non-Relational Tool Key-value, document, wide-column, and graph stores, and how DynamoDB, Cassandra, MongoDB, and Neo4j differ in practice. int 25 min
- Database Partitioning and Sharding: When One Node Is Not Enough Range, hash, and consistent-hash partitioning, hot spots, resharding, and how Notion, Figma, and Discord partition data in practice. adv 25 min
- Database Replication: Keeping Copies in Sync Leader-follower, multi-leader, and leaderless replication, sync vs async, failover, and how Postgres, MySQL, Cassandra, and DynamoDB differ. adv 25 min
- Message Queues and Streaming: Decoupling at Scale Queues vs logs, Kafka vs RabbitMQ vs SQS, delivery semantics, partitioning, consumer groups, and when streaming beats request-response. adv 25 min
- Pub/Sub: Fan-Out and Event-Driven Systems Topics, subscriptions, fan-out strategies, and how Google Pub/Sub, Redis, NATS, and SNS+SQS implement publish/subscribe at scale. int 25 min
- Real-Time Communication: WebSockets, SSE, and Long Polling Long polling, Server-Sent Events, WebSockets, WebRTC, and MQTT: when to use each, how to scale persistent connections, and production lessons from Discord and S… int 25 min
- Rate Limiting: Protecting Systems from Themselves Token bucket, leaky bucket, fixed and sliding windows, distributed rate limiting, and how Stripe, Cloudflare, and GitHub protect their APIs. int 25 min
- Service Discovery and Service Mesh: Finding and Talking to Services DNS, client-side and server-side discovery, health checks, sidecars, mTLS, and what Consul, Envoy, Istio, and Linkerd actually solve. adv 25 min
- Blob and Object Storage: Storing the Big Stuff S3 semantics, object storage internals, multipart uploads, lifecycle policies, and when to pick S3 vs GCS vs MinIO vs a filesystem. int 25 min
- Geospatial Indexing: Geohash, Quadtree, R-tree, S2, and H3 Space-filling curves, hierarchical grids, and tree indexes for location queries - with decision guidance on geohash vs quadtree vs R-tree vs S2 vs H3. int 25 min
- Edge Computing (Cloudflare Workers, Lambda@Edge, Deno Deploy) Design applications for the edge: cold starts, state replication with Durable Objects, edge databases, and the limits of running code close to users. int 25 min
Distributed Systems Theory
Consensus, CRDTs, clocks, consistent hashing. The theory.
- Consensus Protocols: How Distributed Systems Agree Raft and Multi-Paxos explained: leader election, log replication, safety under term numbers, and why etcd, Consul, and CockroachDB picked Raft. int 30 min
- Consistency Deep Dive: Linearizability, Serializability, and the Spectrum Between Linearizability vs serializability vs causal vs eventual, external consistency, and how to reason precisely about what your database actually gives you. adv 25 min
- Quorums and Replication: The Math of R + W > N Read and write quorums, Dynamo-style replication, sloppy quorums, hinted handoff, and when quorums give you linearizability (and when they do not). int 25 min
- CAP and PACELC: The Tradeoff That Keeps Confusing People What CAP actually says (and what it doesn't), the three big misconceptions, and how PACELC fixes the omission of latency. int 25 min
- Clocks and Ordering: Lamport, Vector, and Hybrid Logical Clocks Why wall-clock time lies, happens-before, Lamport clocks, vector clocks, hybrid logical clocks, and Google TrueTime. adv 30 min
- CRDTs: Conflict-Free Replicated Data Types State-based (CvRDT) and op-based (CmRDT) CRDTs, G-Counter, PN-Counter, OR-Set, LWW-Register, and how Yjs, Automerge, Figma, and Redis use them. adv 25 min
- Distributed Transactions: 2PC, Saga, and When to Avoid Both Two-phase commit, Percolator, Sagas, outbox pattern, and the honest answer to distributed transactions: don't, or pay the price. adv 30 min
- Idempotency and Exactly-Once: The Honest Truth About Delivery Guarantees Why exactly-once delivery is a myth, how idempotency keys make at-least-once feel exactly-once, and how Stripe and Kafka implement it. int 25 min
- Failure Detection: Deciding a Node Is Dead Heartbeats, phi-accrual detectors, gossip (SWIM), and why 'is it dead or just slow?' has no correct answer. int 25 min
- Consistent Hashing: Keys to Nodes Without Global Reshuffles The hash ring, virtual nodes, bounded-load variant, rendezvous hashing, Maglev, and jump hash for distributing keys across dynamic node sets. int 25 min
- Merkle Trees and Anti-Entropy: Keeping Replicas in Sync Cheaply Merkle trees, anti-entropy protocols, read repair, hinted handoff, and how Dynamo, Cassandra, and Git use hashed trees to find the differences. adv 25 min
Data Systems
Storage engines, OLAP, streams, search, vectors.
- Storage Engines: B-Trees, LSM-Trees, and Why Your Database Feels the Way It Does How B-tree and LSM-tree storage engines shape read, write, and space amplification, with examples from InnoDB, PostgreSQL, RocksDB, and Cassandra. adv 25 min
- OLTP vs OLAP: Row Stores, Column Stores, and Matching Shape to Workload Why transactional systems use row-oriented storage and analytical systems use columnar, with examples from Postgres, MySQL, Redshift, BigQuery, ClickHouse, and … int 25 min
- Data Warehouses and Data Lakes: Structure, Schema, and the Lakehouse How Redshift, BigQuery, Snowflake, S3-based lakes, and the lakehouse pattern with Delta Lake, Iceberg, and Hudi actually fit together. int 25 min
- Stream vs Batch Processing: Lambda, Kappa, and the End of That Debate Batch with Spark and Hadoop, streaming with Kafka Streams, Flink, and Spark Streaming, and how Lambda and Kappa architectures stack up. int 25 min
- Change Data Capture: Streaming the Database's Inner Monologue How Debezium, Maxwell, and the outbox pattern turn WAL and binlog entries into reliable event streams, and when each approach is the right call. int 25 min
- Search Systems: Inverted Indexes, BM25, and Running Elasticsearch in Production How Elasticsearch, OpenSearch, and Solr build inverted indexes, score with BM25, and handle faceting, relevance tuning, and sharding at scale. int 30 min
- Time-Series Databases: Metrics, Events, and Retention at Scale How Prometheus, InfluxDB, TimescaleDB, and VictoriaMetrics handle write-heavy time-series workloads with downsampling and retention policies. int 25 min
- Graph Databases: Property Graphs, Cypher, and When Joins Are the Problem How Neo4j, Amazon Neptune, and Dgraph model relationships, and when graph queries beat recursive SQL joins. int 25 min
- Vector Databases: Embeddings, ANN Indexes, and the Retrieval Layer for AI How Pinecone, Weaviate, Milvus, and pgvector store and search embeddings using HNSW and IVF approximate nearest neighbor indexes. adv 25 min
- Key-Value Stores: Redis, Memcached, DynamoDB, and Picking the Right Hash Table How Redis, Memcached, and DynamoDB differ in durability, data model, and scaling, and when each is the right key-value store. int 25 min
Architecture Patterns
Microservices, event-driven, CQRS, multi-region.
- Monolith vs Microservices: Team Topology, Conway's Law, and the Distributed System Tax When a modular monolith beats microservices, how Conway's Law shapes architecture, and what the distributed system tax actually costs you. int 25 min
- Event-Driven Architecture: Notifications, State Transfer, and Choreography The three flavors of events, how Kafka and event buses enable loose coupling, and when choreography beats orchestration. int 25 min
- CQRS: Separating Reads from Writes Without Losing Your Mind Command Query Responsibility Segregation in practice: when to split read and write models, how to handle eventual consistency, and when CQRS is overkill. int 20 min
- Event Sourcing: Events as the Source of Truth Storing state as an append-only log of events, with replay, projections, snapshots, and the ops reality of running event-sourced systems. adv 25 min
- Serverless: Functions, Cold Starts, and When FaaS Actually Saves Money AWS Lambda, Google Cloud Functions, and Azure Functions in practice: cold starts, concurrency models, and the honest economics of serverless. int 20 min
- Backend for Frontend: Per-Client API Aggregation Done Right When one API cannot serve web, mobile, and partner clients well, the BFF pattern gives each client its own aggregation layer. int 25 min
- Strangler Fig: Incremental Migration Without a Big Bang Martin Fowler's strangler fig pattern for replacing legacy systems incrementally, with routing, facades, and how teams actually execute multi-year migrations. int 30 min
- Hexagonal and Clean Architecture: Keeping Business Logic Independent Ports and adapters, clean architecture, and onion architecture: how to keep domain logic testable and framework-independent. int 20 min
- Multi-Region Architecture: Active-Passive, Active-Active, and CRDTs Designing systems that survive regional failure: DNS failover, active-passive replication, active-active with CRDTs, and Cloudflare's model. adv 25 min
- Multi-Tenancy: Silo, Pool, and the SaaS Isolation Spectrum Designing SaaS platforms that host many tenants on shared infrastructure: isolation levels, noisy-neighbor defenses, per-tenant metering, and when to graduate a… int 25 min
- CRDT Applications (Yjs, Automerge, Local-First Software) Design local-first collaborative software with CRDTs: Yjs, Automerge, peer-to-peer sync, and the architectural shift away from authoritative central servers. adv 25 min
Reliability and Operations
Observability, SLOs, chaos, deployment strategies.
- Observability: Metrics, Logs, Traces, and the OpenTelemetry Standard The three pillars of observability, USE vs RED methods, and how OpenTelemetry, Prometheus, Grafana, and Jaeger fit together in production. adv 25 min
- SLI, SLO, SLA, and Error Budgets: Making Reliability Quantitative The Google SRE framework for reliability: what to measure, what to target, what to promise, and how error budgets fund feature work vs reliability work. int 25 min
- Resilience Patterns: Timeouts, Retries, Circuit Breakers, and Bulkheads The defensive patterns that keep distributed systems from cascading into total failure, from Hystrix to modern service mesh implementations. int 25 min
- Graceful Degradation: When Partial Service Beats No Service Load shedding, feature flags, cached fallbacks, and the product-engineering decisions behind degrading one feature to save the system. int 25 min
- Auto-Scaling and Capacity Planning: From HPA to Predictive Scaling Horizontal pod autoscalers, cluster autoscalers, predictive scaling, and the capacity planning math that keeps systems sized right without overspending. int 25 min
- Deployment Strategies: Blue-Green, Canary, Rolling, and Feature Flags How to ship changes safely with blue-green, rolling, canary, and progressive delivery, plus the role of feature flags and LaunchDarkly-style tooling. int 25 min
- Chaos Engineering: Breaking Things on Purpose Netflix's Chaos Monkey, the Principles of Chaos, and how to run game days and fault injection experiments without making your on-call call in sick. adv 25 min
- Incident Management: From Detection to Blameless Postmortem On-call, incident command, severity levels, communication, and how to run postmortems that actually change systems instead of blaming people. int 25 min
- Health Checks and Readiness: Telling the Truth About Whether You're Up Liveness, readiness, startup probes, deep vs shallow health checks, and why bad health checks cause more outages than bad code. int 20 min
- Cost Optimization and FinOps Apply FinOps to reduce cloud bills without sacrificing reliability: spot instances, reserved capacity, autoscaling, storage tiering, and unit-economics thinking… int 25 min
- Platform Engineering: IDPs, Golden Paths, and DX Treat the platform as a product: build internal developer platforms with Backstage, golden paths, and DORA/SPACE metrics that move developer productivity. adv 25 min
Security at Scale
OAuth2, JWT, mTLS, DDoS.
- Authentication vs Authorization: Identity, Permissions, and Access Models AuthN vs AuthZ, session vs token auth, and access control models: RBAC, ABAC, ReBAC with examples from SpiceDB, OpenFGA, and AWS IAM. int 25 min
- OAuth 2.0 and OpenID Connect: Delegated Authorization and Identity Done Right The OAuth 2.0 authorization framework, OIDC identity layer, and the Authorization Code + PKCE flow that is the modern standard for web and mobile. adv 25 min
- JWT Deep Dive: Signed Tokens, Claims, and the Revocation Problem How JSON Web Tokens work: JWS signing, JWE encryption, claim validation, key rotation, and the trade-offs of stateless auth. int 25 min
- mTLS and Service-to-Service Authentication: SPIFFE, Service Mesh, and Zero Trust How mutual TLS, SPIFFE/SPIRE, and service meshes like Istio and Linkerd authenticate services without long-lived credentials. adv 25 min
- Secrets Management: Vault, KMS, and the End of Secrets in Config Files Managing API keys, passwords, and certificates with Vault, AWS Secrets Manager, KMS envelope encryption, and dynamic secrets. int 25 min
- DDoS Protection and WAFs: Mitigating Volumetric and Application Attacks Defending against L3/L4 and L7 DDoS with Cloudflare, AWS Shield, and WAFs; rate limiting, bot management, and the OWASP Top 10. int 25 min
- Data Residency and Compliance Architecture (GDPR, DPDP, CCPA, Right-to-Erasure) Designing multi-jurisdictional systems for GDPR, DPDP, CCPA, and LGPD with data classification, regional silos, crypto-shredding, and auditable erasure. adv 30 min
- Supply Chain Security: SBOM, SLSA, Sigstore, and Defending Against xz-utils Protecting the software supply chain with SBOMs, SLSA provenance, Sigstore signing, admission policies, and lessons from xz-utils and SolarWinds. adv 25 min
- Privacy-Preserving Systems (Differential Privacy, Federated Learning) Design systems that protect user data by construction: differential privacy, federated learning, secure aggregation, and an introduction to homomorphic encrypti… adv 30 min
- Post-Quantum Cryptography: Migrating to ML-KEM, ML-DSA, and a Crypto-Agile Future Why harvest-now-decrypt-later makes PQC urgent, what NIST standardized in 2024, and how to migrate production TLS and long-lived secrets to hybrid post-quantum … adv 25 min
Case Studies
56 end-to-end system designs.
- Design a URL Shortener (TinyURL / bit.ly) An interview-grade walkthrough for a URL shortener: capacity estimation, short-code generation, hot-key caching, and an analytics pipeline that never blocks the… int 30 min
- Design a Pastebin (Paste Sharing Service) An interview-grade walkthrough for a Pastebin-style text sharing service: object storage split, TTL-based expiration pipelines, syntax highlighting placement, a… int 30 min
- Design a Distributed Rate Limiter An interview-grade walkthrough for a distributed API rate limiter: algorithm choice, Redis Lua atomicity, two-tier local+global synchronization, and fail-open f… int 30 min
- Design a Distributed Key-Value Store (Dynamo / Cassandra / Riak) Design a distributed KV store with consistent hashing, quorum replication, gossip membership, hinted handoff, and Merkle-tree anti-entropy repair. adv 30 min
- Design a Notification System (Push, SMS, Email at Scale) An interview-grade walkthrough for a multi-channel notification platform: fan-out architecture, APNs/FCM integration, retry with dead-letter queues, and device-… int 35 min
- Design a Chat System (WhatsApp / Messenger / Signal) Staff-level design for 1:1 and small-group chat at WhatsApp scale: 500M concurrent connections, message ordering, E2E encryption, and storage model trade-offs. adv 35 min
- Design a Social Media Feed (Twitter / Instagram / LinkedIn) Fan-out architecture, hybrid push/pull, ML ranking pipelines, and the celebrity problem at Twitter/X scale. adv 35 min
- Design a Photo Sharing Service (Instagram) Design Instagram-scale photo sharing: upload pipeline, transcoding, multi-resolution image serving, CDN with origin shielding, and news feed integration. int 30 min
- Design a Web Crawler (Googlebot-style) Design a distributed web crawler with URL frontier, politeness policies, content deduplication, robots.txt compliance, and Bloom-filter-backed URL dedup at bill… int 30 min
- Design Search Autocomplete (Typeahead Suggestions) Design a low-latency autocomplete system with tries, top-K precomputation, real-time trending overlays, and multi-tier caching at Google scale. int 30 min
- Design a Video Streaming Service (YouTube / Twitch / TikTok) Design a UGC video platform from upload through adaptive bitrate streaming: transcode pipeline, HLS/DASH/CMAF packaging, CDN delivery with ISP peering, and live… int 35 min
- Design Netflix (End-to-End) A whole-system walkthrough of Netflix's architecture: microservices, Open Connect CDN, per-title encoding, Cassandra + EVCache, resilience patterns, and chaos e… adv 35 min
- Design a Ride-Hailing Service (Uber / Lyft) An interview-grade walkthrough for Uber-scale ride-hailing: H3 geospatial indexing, real-time location ingest, batched bipartite matching, surge pricing, and tr… int 35 min
- Design Google Maps (Routing and Tile Rendering) Design planet-scale mapping: map tile rendering, shortest-path routing with contraction hierarchies, ETA prediction with GNNs, and offline maps. int 35 min
- Design a File Sync Service (Dropbox / Google Drive) Design a Dropbox-style file sync service: block-level deduplication, delta sync, conflict resolution, versioning, and client-server reconciliation. int 30 min
- Design Collaborative Editing (Google Docs / Figma / Notion) Staff-level design for real-time collaborative editing at 100K+ concurrent editors: OT vs CRDTs, presence broadcasting, offline sync, and version history. adv 35 min
- Design a Distributed Cache (Memcached / Redis Cluster) Design a Memcached- or Redis-style distributed cache: consistent hashing, eviction, replication, client-side sharding, and hot-key mitigation. int 30 min
- Design a Recommendation System (Netflix / YouTube / TikTok) Design a two-stage recommendation system: candidate generation, ranking, collaborative filtering, content-based features, a feature store, and cold-start handli… adv 30 min
- Design a Ticketing System (BookMyShow / Ticketmaster) An interview-grade walkthrough for high-concurrency ticketing: seat locking with Redis SETNX, saga-based payment, virtual waiting rooms, and anti-bot defenses a… int 30 min
- Design a Payment System (Stripe / PayPal) Design a payment system with a double-entry ledger, idempotency keys, saga-orchestrated cross-service flows, and the compliance constraints that shape every dec… adv 30 min
- Design a Stock Exchange (Matching Engine) Design a deterministic, low-latency matching engine: FIFO order book, price-time priority, multicast market data distribution, and co-location realities. adv 30 min
- Design a Food Delivery Service (DoorDash / Swiggy) Design a three-sided food delivery marketplace: dispatch with batching and reassignment, composite ETA prediction, and driver-merchant-customer coordination. adv 30 min
- Design a Metrics Pipeline (Prometheus / InfluxDB / Thanos) Design a time-series metrics pipeline: high-cardinality ingestion, aggregation across clusters with Thanos or Cortex, alerting, and downsampling for long-term r… adv 30 min
- Design Ad-Click Aggregation (Real-Time Stream Processing) Design an ad-click aggregation system with exactly-once semantics on Kafka + Flink, real-time fraud detection, and low-latency dashboards. adv 30 min
- Design a Logging Platform (ELK / Loki / Splunk) Design a logging platform: ingestion at scale, index vs. label-based storage (Elastic vs Loki), retention tiering, and full-text search with BM25. int 30 min
- Design a Proximity Service (Nearby Friends / Yelp) Design a proximity service for 100M users with 1M concurrent location-sharing sessions: geohash/H3/S2 trade-offs, Redis geosets, bounding-box query fan-out, pri… adv 30 min
- Design a Real-Time Leaderboard Design a real-time leaderboard for 10M players with 100K score updates/sec, tie-breaking, time-windowed views, friend boards, and approximate rank for the tail. int 30 min
- Design a Unique ID Generator (Snowflake, ULID, TSID, UUIDv7) Design a distributed ID generator producing 10M 64-bit IDs/sec, monotonic-ish ordering, clock-skew resilience, and the four-way trade-off between Snowflake, ULI… adv 30 min
- Design a Hotel Reservation System (Booking.com / Airbnb) Staff-level design for hotel reservation: search/booking split, PostgreSQL exclusion constraints for date-range double-booking prevention, Temporal saga orchest… adv 30 min
- Design a Distributed Job Scheduler (Airflow / Temporal / Distributed Cron) Design a scheduler for 100k registered jobs and 10k executions/sec with exactly-once execution, DAGs up to 10k nodes, late/missed run policies, and graceful sch… adv 30 min
- Design ChatGPT (Conversational AI at Scale) Design ChatGPT for 900M weekly users: multi-tenant LLM serving, session-state architecture, streaming SSE, per-user memory, safety, and multi-region deployment. adv 35 min
- Design an Enterprise RAG System Design a multi-tenant enterprise RAG platform for 1k tenants with 10M documents each at 100 QPS/tenant: ingestion, hybrid retrieval, reranking, citation, access… adv 30 min
- Design a Coding Agent (Claude Code / GitHub Copilot / Cursor) Design a coding agent serving 1M concurrent sessions across autocomplete, chat, and autonomous loop modes with repo indexing, sandboxed tool use, and streaming … adv 30 min
- Design Perplexity (AI Search with Citations) Design an AI search engine for 50M MAU, 5k QPS peak, <2 s answer latency with inline citations: query rewriting, source retrieval, citation-grounding, streaming… adv 30 min
- Design a Voice Agent (Alexa / Siri-Class Realtime) Design a realtime voice agent for 100M devices with 50k concurrent conversations and sub-700 ms turn latency: streaming ASR, LLM turn-taking, streaming TTS, Web… adv 30 min
- Design a Content Moderation System at Scale Design moderation for 500M posts/day with <200 ms pre-publish latency, human-in-loop for 0.5% of traffic, multi-modal (text+image+video): classifier cascade, re… adv 30 min
- Design a Semantic Cache for LLM Applications Design an embedding-similarity cache for LLM prompts at 10k QPS, 70%+ hit rate, <10 ms lookup: similarity threshold calibration, invalidation on source change, … adv 30 min
- Design a Model Router and Gateway (OpenRouter / LiteLLM) Design a gateway routing 20k QPS across 50+ models and 10 providers with <30 ms routing overhead: cost/latency/quality routing strategies, provider failover, st… adv 30 min
- Design a Feature Flag Service (LaunchDarkly / Harness FME / Unleash) Design a feature flag and experimentation platform for 20T evaluations/day with sub-millisecond SDK-side latency, streaming config distribution, and sub-60s kil… adv 30 min
- Design a DNS Service (Cloudflare 1.1.1.1 / Google 8.8.8.8) Design a public recursive DNS resolver serving trillions of queries/day globally with <20 ms p99 from 300+ anycast POPs: UDP/TCP/DoT/DoH/DoQ, DNSSEC validation,… adv 30 min
- Design a Dating App (Tinder / Hinge / Bumble) Design a dating app for 100M MAU handling 1.5B swipes/day with <50 ms card loads: two-tower recommendations, geospatial filtering, mutual-match detection, and s… adv 30 min
- Design an Online Auction (eBay / Catawiki) Design an online auction for 100M active listings and 10M concurrent bidders: Redis Lua CAS for atomic bids at 1M/sec peak, proxy bidding, sniping extensions, a… adv 30 min
- Design a Multi-Tenant SaaS Platform Design a multi-tenant SaaS platform serving 50K tenants with per-tenant SLA tiers, metered billing, noisy-neighbor containment, and zero cross-tenant data leaka… adv 30 min
- Design a Video Conferencing System (Zoom / Google Meet) Design a video conferencing platform for 500K simultaneous meetings and 10M concurrent participants with <150 ms audio and <500 ms video: SFU vs MCU vs P2P, sim… adv 30 min
- Design an Email Service at Gmail Scale (1.8B Users, 300B Messages/Day) Design a global email service for 1.8B users and 300B messages/day: SMTP ingress, spam pipeline cascade, per-user sharded search, RFC 5322 threading, and exabyt… adv 30 min
- Design Live Comments at Scale (FB Live / YouTube Live / Twitch Chat) Design a live-comment system for 10M concurrent viewers and 100K commenters on one stream: delta-batched fan-out, pre-publish moderation, and the celebrity-stre… adv 30 min
- Design a Fraud Detection System (Stripe Radar / PayPal / Feedzai) Design a real-time fraud detection service scoring millions of events per second under 100 ms p99 with a rules-ML cascade, online/offline feature store, graph r… adv 30 min
- Design a Fitness Tracking Service (Strava / MapMyRun) Design a fitness tracking service for 195M+ users: GPS ingestion, two-stage segment matching with H3 pre-filter and DTW, Kafka-backed leaderboards, and privacy-… adv 25 min
- Design an Online Judge (LeetCode / Codeforces / HackerEarth) Design an online judge for 1M users at 100K submissions/hour peak with <15s verdict p99: Firecracker sandboxes, priority queueing for contests, seccomp + cgroup… adv 30 min
- Design a Price Tracking Service (CamelCamelCamel / Honey / Keepa) Design a price-tracking service that watches 100M product URLs with priority-driven scraping, diff-based alerting to 10M subscribers, 2-year historical retentio… int 30 min
- Design an API Gateway at Scale (Kong / AWS API Gateway / Apigee / Envoy) Design an API gateway that handles 100K RPS per instance with <5 ms p99 overhead across 10K upstream services: routing trie, local+global rate limiting, mTLS te… adv 30 min
- Design a CI/CD Platform (GitHub Actions / GitLab CI / CircleCI) Design a CI/CD platform for 100K orgs and 10M workflow runs/day: YAML DAG execution, ephemeral runner pools (Firecracker), content-addressed artifacts, dependen… adv 30 min
- Design an Observability Platform (Datadog / New Relic / Honeycomb) Design a unified observability platform for 10M hosts and 1B events/sec: OTLP ingestion across metrics, logs, and traces, cardinality control, trace-log correla… adv 30 min
- Design a Search Engine (Google-Scale / Brave Search) Design a web-scale search engine over 10B documents serving 100K queries/sec at <200 ms p99: sharded inverted index, BM25 + PageRank + neural re-ranking cascade… adv 30 min
- Design a Brokerage Platform (Robinhood / E*TRADE / Interactive Brokers) Design a retail brokerage for 30M users: order routing, symbol-channel quote fanout, fractional-share aggregation, tax-lot accounting, and seven-year regulatory… adv 30 min
- Design Channel-Scale Chat (Discord / Slack) Design channel-scale chat for 100K+ member channels with pub/sub fanout, RBAC on the hot path, presence aggregation, and workspace search. adv 30 min
AI & ML System Design
LLM serving, RAG, agents, multi-agent orchestration, evaluation, cost, safety, ML fundamentals, feature stores, recommendations, multimodal, voice.
- LLM Serving Architecture (vLLM, TGI, TensorRT-LLM) Design a production LLM inference stack: continuous batching, paged attention, KV-cache management, and multi-tenant GPU scheduling. adv 25 min
- RAG Pipelines (Retrieval-Augmented Generation) Design production RAG: chunking, embedding models, hybrid dense-plus-sparse retrieval, reranking, and the eval loops that keep it honest. int 25 min
- Vector Search at Scale (HNSW, IVF-PQ, DiskANN) Design billion-scale vector search: HNSW, IVF-PQ, and DiskANN indexes, product quantization, hybrid BM25-vector search, and sharding strategies. adv 25 min
- AI Agent Architectures (ReAct, Reflection, Planning, Tool Use, Memory) The canonical patterns for turning an LLM into an agent: ReAct's think-act-observe loop, reflection and self-critique, planner-executor decomposition, tool use … adv 25 min
- Multi-Agent Orchestration (LangGraph, OpenAI Agents SDK, AutoGen, Swarm) Composing multiple agents into a reliable system: orchestrator-worker topologies, handoffs and delegation, shared memory, parallel fan-out, and the failure mode… adv 25 min
- LLM Evaluation and Observability (Ragas, LangSmith, TruLens, LLM-as-Judge) How to evaluate LLM systems before and after they ship: golden datasets, reference-free metrics, LLM-as-judge, continuous eval pipelines, and the observability … adv 25 min
- LLMOps and Prompt Engineering (Versioning, Guardrails, Red-Teaming) The operational side of shipping LLM features: prompt-as-code, versioning, rollback, A/B testing prompts, structured outputs, and red-teaming before launch. int 30 min
- LLM Cost Optimisation (Semantic Cache, Model Routing, Cascading, Prompt Caching) The cost-engineering toolbox for production LLMs: semantic caching, model routing, cascade small-then-big, prompt caching (Anthropic, OpenAI), and the unit econ… int 30 min
- LLM Safety and Guardrails (OWASP LLM Top 10, Prompt Injection, PII, Jailbreaks) The safety-engineering surface for LLM applications: OWASP LLM Top 10, prompt-injection defence, PII redaction, jailbreak containment, and the defence-in-depth … adv 25 min
- ML System Design Fundamentals The classic ML systems backbone every modern AI product sits on: candidate generation, ranking, two-tower embeddings, offline/online feature parity, and the tra… int 25 min
- Feature Stores and Model Serving (Feast, Tecton, KServe, BentoML, MLflow) The infrastructure that makes ML shippable: online and offline feature stores, the model registry, model servers, shadow deploys, and the production lifecycle a… adv 30 min
- Recommendation Systems Deep Dive (DLRM, Two-Tower, Embedding Retrieval, Cold Start) How modern recommenders actually work end-to-end: candidate gen via ANN on embeddings, DLRM-style ranking, exploration-exploitation, cold-start handling, and th… adv 25 min
- Realtime AI and Voice Agents (Streaming Inference, WebRTC, LiveKit, Deepgram) Designing sub-second voice agents: streaming ASR, low-latency LLM inference, streaming TTS, WebRTC transport, interruption handling, and the end-to-end latency … adv 25 min
- Multimodal AI Systems (CLIP, Whisper, LayoutLM, Document AI) Designing systems that ingest images, audio, video, and documents: CLIP-style embeddings for cross-modal retrieval, Whisper pipelines, OCR-plus-layout models, a… int 25 min
- Data Infrastructure for AI (Embedding Pipelines, Chunking, Unstructured ETL, MCP) The data plane that feeds AI systems: source connectors, chunking strategies, embedding at scale, metadata schema, freshness, and the Model Context Protocol as … int 30 min
Emerging Patterns
Green computing and forward-looking topics that have not yet settled into a canonical home. Slim by design: new primitives land here first, then graduate into the relevant Part once they mature.
Interview Framework
RESHADED, diagramming, trade-off articulation, company-specific flavours.
- Interview Frameworks Compared (RESHADED, PEDALS, ADEPT) Compare the major system-design interview frameworks and pick the one you will use in every interview. int 25 min
- Requirements Scoping: Functional, Non-Functional, and MoSCoW Master the first five minutes of a system design interview: functional vs non-functional requirements, MoSCoW prioritization, and time-boxing the scope. int 30 min
- Diagramming Skills for System Design Interviews Build whiteboard and virtual diagrams that interviewers can read: consistent notation, flow direction, and the tools (Excalidraw, Miro) that the industry prefer… int 25 min
- Trade-off Articulation: Saying 'It Depends' Well Learn to verbalize design trade-offs the way senior engineers do: name the axis, state the dependency, and commit to a choice. int 20 min
- Company-Specific Interview Flavors (Amazon, Google, Meta, Netflix) How the same system design question looks different at Amazon, Google, Meta, and Netflix: leadership principles, scale emphasis, product sense, and simplicity. int 25 min
- Design Doc Authoring: RFCs, ADRs, and the Staff Engineer's Written Output How to write ADRs, design docs, and RFCs that drive alignment, record decisions, and demonstrate Staff+ engineering judgment. adv 25 min