Part 9: AI & ML System Design

9.0advanced
LLM Serving Architecture (vLLM, TGI, TensorRT-LLM)
Design a production LLM inference stack: continuous batching, paged attention, KV-cache management, and multi-tenant GPU scheduling.
25 min
9.1intermediate
RAG Pipelines (Retrieval-Augmented Generation)
Design production RAG: chunking, embedding models, hybrid dense-plus-sparse retrieval, reranking, and the eval loops that keep it honest.
25 min
9.2advanced
Vector Search at Scale (HNSW, IVF-PQ, DiskANN)
Design billion-scale vector search: HNSW, IVF-PQ, and DiskANN indexes, product quantization, hybrid BM25-vector search, and sharding strategies.
25 min
9.3advanced
AI Agent Architectures (ReAct, Reflection, Planning, Tool Use, Memory)
The canonical patterns for turning an LLM into an agent: ReAct's think-act-observe loop, reflection and self-critique, planner-executor decomposition, tool use and function calling, and how agents manage short- and long-term memory.
25 min
9.4advanced
Multi-Agent Orchestration (LangGraph, OpenAI Agents SDK, AutoGen, Swarm)
Composing multiple agents into a reliable system: orchestrator-worker topologies, handoffs and delegation, shared memory, parallel fan-out, and the failure modes of agent graphs.
25 min
9.5advanced
LLM Evaluation and Observability (Ragas, LangSmith, TruLens, LLM-as-Judge)
How to evaluate LLM systems before and after they ship: golden datasets, reference-free metrics, LLM-as-judge, continuous eval pipelines, and the observability stack for production LLMs.
25 min
9.6intermediate
LLMOps and Prompt Engineering (Versioning, Guardrails, Red-Teaming)
The operational side of shipping LLM features: prompt-as-code, versioning, rollback, A/B testing prompts, structured outputs, and red-teaming before launch.
30 min
9.7intermediate
LLM Cost Optimisation (Semantic Cache, Model Routing, Cascading, Prompt Caching)
The cost-engineering toolbox for production LLMs: semantic caching, model routing, cascade small-then-big, prompt caching (Anthropic, OpenAI), and the unit economics that decide per-request margin.
30 min
9.8advanced
LLM Safety and Guardrails (OWASP LLM Top 10, Prompt Injection, PII, Jailbreaks)
The safety-engineering surface for LLM applications: OWASP LLM Top 10, prompt-injection defence, PII redaction, jailbreak containment, and the defence-in-depth model for public-facing agents.
25 min
9.9intermediate
ML System Design Fundamentals
The classic ML systems backbone every modern AI product sits on: candidate generation, ranking, two-tower embeddings, offline/online feature parity, and the training-serving skew problem.
25 min
9.10advanced
Feature Stores and Model Serving (Feast, Tecton, KServe, BentoML, MLflow)
The infrastructure that makes ML shippable: online and offline feature stores, the model registry, model servers, shadow deploys, and the production lifecycle around a trained model.
30 min
9.11advanced
Recommendation Systems Deep Dive (DLRM, Two-Tower, Embedding Retrieval, Cold Start)
How modern recommenders actually work end-to-end: candidate gen via ANN on embeddings, DLRM-style ranking, exploration-exploitation, cold-start handling, and the evaluation loop that keeps metrics honest.
25 min
9.12advanced
Realtime AI and Voice Agents (Streaming Inference, WebRTC, LiveKit, Deepgram)
Designing sub-second voice agents: streaming ASR, low-latency LLM inference, streaming TTS, WebRTC transport, interruption handling, and the end-to-end latency budget.
25 min
9.13intermediate
Multimodal AI Systems (CLIP, Whisper, LayoutLM, Document AI)
Designing systems that ingest images, audio, video, and documents: CLIP-style embeddings for cross-modal retrieval, Whisper pipelines, OCR-plus-layout models, and the storage architecture for unstructured data.
25 min
9.14intermediate
Data Infrastructure for AI (Embedding Pipelines, Chunking, Unstructured ETL, MCP)
The data plane that feeds AI systems: source connectors, chunking strategies, embedding at scale, metadata schema, freshness, and the Model Context Protocol as a standard interface.
30 min

LLM Serving Architecture (vLLM, TGI, TensorRT-LLM)

RAG Pipelines (Retrieval-Augmented Generation)

Vector Search at Scale (HNSW, IVF-PQ, DiskANN)

AI Agent Architectures (ReAct, Reflection, Planning, Tool Use, Memory)

Multi-Agent Orchestration (LangGraph, OpenAI Agents SDK, AutoGen, Swarm)

LLM Evaluation and Observability (Ragas, LangSmith, TruLens, LLM-as-Judge)

LLMOps and Prompt Engineering (Versioning, Guardrails, Red-Teaming)

LLM Cost Optimisation (Semantic Cache, Model Routing, Cascading, Prompt Caching)

LLM Safety and Guardrails (OWASP LLM Top 10, Prompt Injection, PII, Jailbreaks)

ML System Design Fundamentals

Feature Stores and Model Serving (Feast, Tecton, KServe, BentoML, MLflow)

Recommendation Systems Deep Dive (DLRM, Two-Tower, Embedding Retrieval, Cold Start)

Realtime AI and Voice Agents (Streaming Inference, WebRTC, LiveKit, Deepgram)

Multimodal AI Systems (CLIP, Whisper, LayoutLM, Document AI)

Data Infrastructure for AI (Embedding Pipelines, Chunking, Unstructured ETL, MCP)