Database Basics

The database is the most consequential architectural decision in most systems. The choice of storage engine, data model, and replication strategy shapes everything downstream — from query patterns to scaling approach to operational complexity.

SQL vs NoSQL

The first fork in every data layer decision: relational or non-relational. The answer depends on your data model, consistency requirements, and scale targets.

SQL (Relational)NoSQL
Data modelTables with rows and columns, strict schemaDocuments, key-value, wide-column, or graph — flexible schema
ConsistencyACID transactionsUsually eventual consistency (BASE)
JoinsFirst-class supportTypically avoided — denormalize instead
ScalingVertical primarily; horizontal via read replicas or shardingHorizontal by design
Best forComplex relationships, strong consistency, reportingHigh write throughput, flexible schema, massive scale
ExamplesPostgreSQL, MySQL, SQLiteMongoDB, Cassandra, DynamoDB, Redis

NoSQL data models

  1. Document store (MongoDB, Firestore) — stores JSON-like documents. Good for nested, variable-shape data. Queries within a document are fast; cross-document joins are expensive.
  2. Key-value store (Redis, DynamoDB) — the simplest model: a key maps to a blob. Extremely fast for lookups by key; poor for range scans or complex queries.
  3. Wide-column store (Cassandra, HBase) — rows identified by a partition key, columns grouped into families. Optimised for writing and reading entire rows at high throughput across a distributed cluster.
  4. Graph database (Neo4j) — nodes and edges with properties. Native support for relationship traversal. Ideal for social graphs, recommendation engines, and fraud detection — anywhere the connections between data matter as much as the data itself.

ACID

ACID guarantees define what it means for a relational database transaction to be safe.

  1. Atomicity — a transaction either fully completes or fully rolls back. No partial writes. If a bank transfer debits one account but crashes before crediting the other, the debit is automatically undone.
  2. Consistency — a transaction takes the database from one valid state to another, never violating defined constraints (foreign keys, unique constraints, check constraints). An inconsistent write is rejected.
  3. Isolation — concurrent transactions execute as if they were sequential. One transaction cannot see another's uncommitted changes. The exact behaviour is configurable via isolation levels (read committed, repeatable read, serializable).
  4. Durability — once committed, a transaction survives crashes. Data is written to persistent storage via a Write-Ahead Log (WAL) before the commit is acknowledged.

BASE — the NoSQL alternative

Most NoSQL databases trade ACID for BASE semantics to achieve horizontal scale:

  1. Basically Available — the system responds to every request, even if some nodes are down.
  2. Soft state — the state of the system may change over time, even without new input, as replicas converge.
  3. Eventually consistent — given no new updates, all replicas will eventually reach the same state.

BASE is not a failure mode — it is a deliberate trade-off: accept stale reads in exchange for lower write latency and higher availability under partition.

Read Replicas

Most production workloads are read-heavy (often 80–90% reads). A read replica is an asynchronous copy of the primary database that serves read queries, offloading the primary so it can focus on writes.

How replication works

  1. The primary writes every committed transaction to a replication log (binlog in MySQL, WAL in PostgreSQL).
  2. Replicas tail that log and apply the same changes, typically with a lag of milliseconds to low seconds.
  3. Reads are directed to replicas; writes go to the primary. The application or a proxy (ProxySQL, PgBouncer) handles routing.

Replication lag and stale reads

Because replication is asynchronous, a read immediately after a write may return stale data. Strategies to handle this:

  1. Read-your-writes consistency — after a user performs a write, route their subsequent reads to the primary for a short window. Other users may still read from the replica.
  2. Monotonic reads — ensure a user always reads from the same replica so they never see data "go backwards."
  3. Strong consistency — route all reads to the primary. Eliminates stale reads but removes the load-balancing benefit of replicas entirely.
Synchronous replicationAsynchronous replication
DurabilityWrite confirmed only after replica ACKsWrite confirmed before replica ACKs
Data loss on primary failureNone — replica is fully up to datePossible — replica may lag
Write latencyHigher — waits for replica round-tripLower — returns immediately
Common useFinancial systems, one synchronous standbyMost read replicas

Indexes

Without an index, a database must scan every row in a table to find matches — an O(n) full table scan. An index pre-sorts a subset of the data so the database can jump directly to relevant rows. The trade-off: indexes speed up reads but slow down writes (every insert/update must also update the index) and consume additional storage.

B-Tree index

The default index type in PostgreSQL, MySQL, and most SQL databases. A balanced tree structure where each leaf node holds the indexed value and a pointer to the actual row. Supports:

  1. Exact lookups: WHERE id = 42
  2. Range queries: WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
  3. Prefix matching: WHERE name LIKE 'Jo%'
  4. Ordering: ORDER BY indexed_column can use the index without a sort step

Hash index

Stores a hash of the indexed value. Extremely fast for exact-match lookups (O(1) average), but cannot support range queries or ordering. Rarely used as a standalone index type; more relevant inside the query engine for hash joins.

Composite indexes

An index on multiple columns: INDEX ON (user_id, created_at). The column order matters — the index is useful for queries that filter on user_id alone or on user_id + created_at together, but not on created_at alone (the leftmost-prefix rule). Design composite indexes around your most frequent query patterns.

When not to index

  1. Low-cardinality columns (boolean flags, enums with few values) — the index may not be selective enough to be worth using; the planner may ignore it.
  2. Write-heavy tables where the index maintenance overhead outweighs the read benefit.
  3. Small tables where a full scan is fast enough — the query planner will often choose a sequential scan over an index anyway.