Part 9 of 11

AI & ML System Design

LLM serving, RAG, agents, multi-agent orchestration, evaluation, cost, safety, ML fundamentals, feature stores, recommendations, multimodal, voice.

Modules
15
Hours
7
Difficulty
Intermediate to Advanced
  1. 9.0advanced

    LLM Serving Architecture (vLLM, TGI, TensorRT-LLM)

    Design a production LLM inference stack: continuous batching, paged attention, KV-cache management, and multi-tenant GPU scheduling.

    25 min
  2. 9.1intermediate

    RAG Pipelines (Retrieval-Augmented Generation)

    Design production RAG: chunking, embedding models, hybrid dense-plus-sparse retrieval, reranking, and the eval loops that keep it honest.

    25 min
  3. 9.2advanced

    Vector Search at Scale (HNSW, IVF-PQ, DiskANN)

    Design billion-scale vector search: HNSW, IVF-PQ, and DiskANN indexes, product quantization, hybrid BM25-vector search, and sharding strategies.

    25 min
  4. 9.3advanced

    AI Agent Architectures (ReAct, Reflection, Planning, Tool Use, Memory)

    The canonical patterns for turning an LLM into an agent: ReAct's think-act-observe loop, reflection and self-critique, planner-executor decomposition, tool use and function calling, and how agents manage short- and long-term memory.

    25 min
  5. 9.4advanced

    Multi-Agent Orchestration (LangGraph, OpenAI Agents SDK, AutoGen, Swarm)

    Composing multiple agents into a reliable system: orchestrator-worker topologies, handoffs and delegation, shared memory, parallel fan-out, and the failure modes of agent graphs.

    25 min
  6. 9.5advanced

    LLM Evaluation and Observability (Ragas, LangSmith, TruLens, LLM-as-Judge)

    How to evaluate LLM systems before and after they ship: golden datasets, reference-free metrics, LLM-as-judge, continuous eval pipelines, and the observability stack for production LLMs.

    25 min
  7. 9.6intermediate

    LLMOps and Prompt Engineering (Versioning, Guardrails, Red-Teaming)

    The operational side of shipping LLM features: prompt-as-code, versioning, rollback, A/B testing prompts, structured outputs, and red-teaming before launch.

    30 min
  8. 9.7intermediate

    LLM Cost Optimisation (Semantic Cache, Model Routing, Cascading, Prompt Caching)

    The cost-engineering toolbox for production LLMs: semantic caching, model routing, cascade small-then-big, prompt caching (Anthropic, OpenAI), and the unit economics that decide per-request margin.

    30 min
  9. 9.8advanced

    LLM Safety and Guardrails (OWASP LLM Top 10, Prompt Injection, PII, Jailbreaks)

    The safety-engineering surface for LLM applications: OWASP LLM Top 10, prompt-injection defence, PII redaction, jailbreak containment, and the defence-in-depth model for public-facing agents.

    25 min
  10. 9.9intermediate

    ML System Design Fundamentals

    The classic ML systems backbone every modern AI product sits on: candidate generation, ranking, two-tower embeddings, offline/online feature parity, and the training-serving skew problem.

    25 min
  11. 9.10advanced

    Feature Stores and Model Serving (Feast, Tecton, KServe, BentoML, MLflow)

    The infrastructure that makes ML shippable: online and offline feature stores, the model registry, model servers, shadow deploys, and the production lifecycle around a trained model.

    30 min
  12. 9.11advanced

    Recommendation Systems Deep Dive (DLRM, Two-Tower, Embedding Retrieval, Cold Start)

    How modern recommenders actually work end-to-end: candidate gen via ANN on embeddings, DLRM-style ranking, exploration-exploitation, cold-start handling, and the evaluation loop that keeps metrics honest.

    25 min
  13. 9.12advanced

    Realtime AI and Voice Agents (Streaming Inference, WebRTC, LiveKit, Deepgram)

    Designing sub-second voice agents: streaming ASR, low-latency LLM inference, streaming TTS, WebRTC transport, interruption handling, and the end-to-end latency budget.

    25 min
  14. 9.13intermediate

    Multimodal AI Systems (CLIP, Whisper, LayoutLM, Document AI)

    Designing systems that ingest images, audio, video, and documents: CLIP-style embeddings for cross-modal retrieval, Whisper pipelines, OCR-plus-layout models, and the storage architecture for unstructured data.

    25 min
  15. 9.14intermediate

    Data Infrastructure for AI (Embedding Pipelines, Chunking, Unstructured ETL, MCP)

    The data plane that feeds AI systems: source connectors, chunking strategies, embedding at scale, metadata schema, freshness, and the Model Context Protocol as a standard interface.

    30 min