Reliability and Operations
Observability, SLOs, chaos, deployment strategies.
- Modules
- 11
- Hours
- 5
- Difficulty
- Intermediate to Advanced
- 6.0advanced
Observability: Metrics, Logs, Traces, and the OpenTelemetry Standard
The three pillars of observability, USE vs RED methods, and how OpenTelemetry, Prometheus, Grafana, and Jaeger fit together in production.
- 6.1intermediate
SLI, SLO, SLA, and Error Budgets: Making Reliability Quantitative
The Google SRE framework for reliability: what to measure, what to target, what to promise, and how error budgets fund feature work vs reliability work.
- 6.2intermediate
Resilience Patterns: Timeouts, Retries, Circuit Breakers, and Bulkheads
The defensive patterns that keep distributed systems from cascading into total failure, from Hystrix to modern service mesh implementations.
- 6.3intermediate
Graceful Degradation: When Partial Service Beats No Service
Load shedding, feature flags, cached fallbacks, and the product-engineering decisions behind degrading one feature to save the system.
- 6.4intermediate
Auto-Scaling and Capacity Planning: From HPA to Predictive Scaling
Horizontal pod autoscalers, cluster autoscalers, predictive scaling, and the capacity planning math that keeps systems sized right without overspending.
- 6.5intermediate
Deployment Strategies: Blue-Green, Canary, Rolling, and Feature Flags
How to ship changes safely with blue-green, rolling, canary, and progressive delivery, plus the role of feature flags and LaunchDarkly-style tooling.
- 6.6advanced
Chaos Engineering: Breaking Things on Purpose
Netflix's Chaos Monkey, the Principles of Chaos, and how to run game days and fault injection experiments without making your on-call call in sick.
- 6.7intermediate
Incident Management: From Detection to Blameless Postmortem
On-call, incident command, severity levels, communication, and how to run postmortems that actually change systems instead of blaming people.
- 6.8intermediate
Health Checks and Readiness: Telling the Truth About Whether You're Up
Liveness, readiness, startup probes, deep vs shallow health checks, and why bad health checks cause more outages than bad code.
- 6.9intermediate
Cost Optimization and FinOps
Apply FinOps to reduce cloud bills without sacrificing reliability: spot instances, reserved capacity, autoscaling, storage tiering, and unit-economics thinking.
- 6.10advanced
Platform Engineering: IDPs, Golden Paths, and DX
Treat the platform as a product: build internal developer platforms with Backstage, golden paths, and DORA/SPACE metrics that move developer productivity.