Part 6: Reliability and Operations — The HLD Handbook

6.0advanced
Observability: Metrics, Logs, Traces, and the OpenTelemetry Standard
The three pillars of observability, USE vs RED methods, and how OpenTelemetry, Prometheus, Grafana, and Jaeger fit together in production.
25 min PrometheusGrafanaJaeger+1
6.1intermediate
SLI, SLO, SLA, and Error Budgets: Making Reliability Quantitative
The Google SRE framework for reliability: what to measure, what to target, what to promise, and how error budgets fund feature work vs reliability work.
25 min PrometheusGrafana
6.2intermediate
Resilience Patterns: Timeouts, Retries, Circuit Breakers, and Bulkheads
The defensive patterns that keep distributed systems from cascading into total failure, from Hystrix to modern service mesh implementations.
25 min EnvoyIstio
6.3intermediate
Graceful Degradation: When Partial Service Beats No Service
Load shedding, feature flags, cached fallbacks, and the product-engineering decisions behind degrading one feature to save the system.
25 min Cloudflare
6.4intermediate
Auto-Scaling and Capacity Planning: From HPA to Predictive Scaling
Horizontal pod autoscalers, cluster autoscalers, predictive scaling, and the capacity planning math that keeps systems sized right without overspending.
25 min KafkaPrometheusCloudflare
6.5intermediate
Deployment Strategies: Blue-Green, Canary, Rolling, and Feature Flags
How to ship changes safely with blue-green, rolling, canary, and progressive delivery, plus the role of feature flags and LaunchDarkly-style tooling.
25 min
6.6advanced
Chaos Engineering: Breaking Things on Purpose
Netflix's Chaos Monkey, the Principles of Chaos, and how to run game days and fault injection experiments without making your on-call call in sick.
25 min Istio
6.7intermediate
Incident Management: From Detection to Blameless Postmortem
On-call, incident command, severity levels, communication, and how to run postmortems that actually change systems instead of blaming people.
25 min Cloudflare
6.8intermediate
Health Checks and Readiness: Telling the Truth About Whether You're Up
Liveness, readiness, startup probes, deep vs shallow health checks, and why bad health checks cause more outages than bad code.
20 min EnvoyIstioConsul
6.9intermediate
Cost Optimization and FinOps
Apply FinOps to reduce cloud bills without sacrificing reliability: spot instances, reserved capacity, autoscaling, storage tiering, and unit-economics thinking.
25 min
6.10advanced
Platform Engineering: IDPs, Golden Paths, and DX
Treat the platform as a product: build internal developer platforms with Backstage, golden paths, and DORA/SPACE metrics that move developer productivity.
25 min