Part 4: Data Systems — The HLD Handbook

4.0advanced
Storage Engines: B-Trees, LSM-Trees, and Why Your Database Feels the Way It Does
How B-tree and LSM-tree storage engines shape read, write, and space amplification, with examples from InnoDB, PostgreSQL, RocksDB, and Cassandra.
25 min MySQLPostgreSQLCassandra+3
4.1intermediate
OLTP vs OLAP: Row Stores, Column Stores, and Matching Shape to Workload
Why transactional systems use row-oriented storage and analytical systems use columnar, with examples from Postgres, MySQL, Redshift, BigQuery, ClickHouse, and Snowflake.
25 min PostgreSQLMySQLBigQuery+6
4.2intermediate
Data Warehouses and Data Lakes: Structure, Schema, and the Lakehouse
How Redshift, BigQuery, Snowflake, S3-based lakes, and the lakehouse pattern with Delta Lake, Iceberg, and Hudi actually fit together.
25 min BigQueryS3Spark+6
4.3intermediate
Stream vs Batch Processing: Lambda, Kappa, and the End of That Debate
Batch with Spark and Hadoop, streaming with Kafka Streams, Flink, and Spark Streaming, and how Lambda and Kappa architectures stack up.
25 min KafkaFlinkSpark+2
4.4intermediate
Change Data Capture: Streaming the Database's Inner Monologue
How Debezium, Maxwell, and the outbox pattern turn WAL and binlog entries into reliable event streams, and when each approach is the right call.
25 min PostgreSQLMySQLKafka+2
4.5intermediate
Search Systems: Inverted Indexes, BM25, and Running Elasticsearch in Production
How Elasticsearch, OpenSearch, and Solr build inverted indexes, score with BM25, and handle faceting, relevance tuning, and sharding at scale.
30 min ElasticsearchOpenSearchPostgreSQL+1
4.6intermediate
Time-Series Databases: Metrics, Events, and Retention at Scale
How Prometheus, InfluxDB, TimescaleDB, and VictoriaMetrics handle write-heavy time-series workloads with downsampling and retention policies.
25 min PrometheusGrafanaInfluxDB+2
4.7intermediate
Graph Databases: Property Graphs, Cypher, and When Joins Are the Problem
How Neo4j, Amazon Neptune, and Dgraph model relationships, and when graph queries beat recursive SQL joins.
25 min Neo4jRocksDB
4.8advanced
Vector Databases: Embeddings, ANN Indexes, and the Retrieval Layer for AI
How Pinecone, Weaviate, Milvus, and pgvector store and search embeddings using HNSW and IVF approximate nearest neighbor indexes.
25 min PineconeWeaviateMilvus+6
4.9intermediate
Key-Value Stores: Redis, Memcached, DynamoDB, and Picking the Right Hash Table
How Redis, Memcached, and DynamoDB differ in durability, data model, and scaling, and when each is the right key-value store.
25 min RedisMemcachedDynamoDB+1