Part 6 of 11

Reliability and Operations

Observability, SLOs, chaos, deployment strategies.

Modules
11
Hours
5
Difficulty
Intermediate to Advanced
  1. 6.0advanced

    Observability: Metrics, Logs, Traces, and the OpenTelemetry Standard

    The three pillars of observability, USE vs RED methods, and how OpenTelemetry, Prometheus, Grafana, and Jaeger fit together in production.

    25 min PrometheusGrafanaJaeger+1
  2. 6.1intermediate

    SLI, SLO, SLA, and Error Budgets: Making Reliability Quantitative

    The Google SRE framework for reliability: what to measure, what to target, what to promise, and how error budgets fund feature work vs reliability work.

    25 min PrometheusGrafana
  3. 6.2intermediate

    Resilience Patterns: Timeouts, Retries, Circuit Breakers, and Bulkheads

    The defensive patterns that keep distributed systems from cascading into total failure, from Hystrix to modern service mesh implementations.

    25 min EnvoyIstio
  4. 6.3intermediate

    Graceful Degradation: When Partial Service Beats No Service

    Load shedding, feature flags, cached fallbacks, and the product-engineering decisions behind degrading one feature to save the system.

    25 min Cloudflare
  5. 6.4intermediate

    Auto-Scaling and Capacity Planning: From HPA to Predictive Scaling

    Horizontal pod autoscalers, cluster autoscalers, predictive scaling, and the capacity planning math that keeps systems sized right without overspending.

    25 min KafkaPrometheusCloudflare
  6. 6.5intermediate

    Deployment Strategies: Blue-Green, Canary, Rolling, and Feature Flags

    How to ship changes safely with blue-green, rolling, canary, and progressive delivery, plus the role of feature flags and LaunchDarkly-style tooling.

    25 min
  7. 6.6advanced

    Chaos Engineering: Breaking Things on Purpose

    Netflix's Chaos Monkey, the Principles of Chaos, and how to run game days and fault injection experiments without making your on-call call in sick.

    25 min Istio
  8. 6.7intermediate

    Incident Management: From Detection to Blameless Postmortem

    On-call, incident command, severity levels, communication, and how to run postmortems that actually change systems instead of blaming people.

    25 min Cloudflare
  9. 6.8intermediate

    Health Checks and Readiness: Telling the Truth About Whether You're Up

    Liveness, readiness, startup probes, deep vs shallow health checks, and why bad health checks cause more outages than bad code.

    20 min EnvoyIstioConsul
  10. 6.9intermediate

    Cost Optimization and FinOps

    Apply FinOps to reduce cloud bills without sacrificing reliability: spot instances, reserved capacity, autoscaling, storage tiering, and unit-economics thinking.

    25 min
  11. 6.10advanced

    Platform Engineering: IDPs, Golden Paths, and DX

    Treat the platform as a product: build internal developer platforms with Backstage, golden paths, and DORA/SPACE metrics that move developer productivity.

    25 min