Normalization

Normalization is the process of structuring a relational database schema to reduce data redundancy and prevent update anomalies. Each normal form builds on the previous one, progressively eliminating a specific class of design flaw. In practice, most production schemas aim for Third Normal Form and denormalize deliberately when read performance demands it.

What is Normalization

An unnormalized schema stores the same fact in multiple places. This creates three classes of anomaly:

  1. Update anomaly — changing one fact requires updating it in multiple rows. Miss one and the data is inconsistent.
  2. Insert anomaly — you cannot insert one piece of data without inserting another unrelated piece alongside it (e.g. you cannot record a department until at least one employee is assigned to it).
  3. Delete anomaly — deleting a row inadvertently destroys unrelated data (e.g. removing the last employee in a department also deletes the department's name and location).

Normalization solves these anomalies by ensuring each fact is stored in exactly one place.

First Normal Form (1NF)

A table is in 1NF when every column contains only atomic (indivisible) values — no lists, arrays, or comma-separated strings — and every row is uniquely identifiable by a primary key.

Violation example

orders ────────────────────────────────────────── order_id customer items 1 Alice "Laptop, Mouse, Pad" ← multi-value cell 2 Bob "Keyboard"

The items column stores multiple values in a single cell. You cannot query for all orders containing "Mouse" without parsing the string in the application.

Fix: extract the repeating group into its own table

orders order_items ────────────────── ────────────────────── order_id (PK) ←── order_id (FK) customer item_id (PK) product_name

Now each row contains one value per column and can be queried, indexed, and joined cleanly.

Second Normal Form (2NF)

A table is in 2NF when it is in 1NF and every non-key column is fully dependent on the entire primary key. 2NF only applies when the primary key is composite (made of multiple columns) — a table with a single-column PK is automatically in 2NF if it satisfies 1NF.

Violation example

order_items ────────────────────────────────────────────── order_id (PK) product_id (PK) product_name quantity 1001 42 "Laptop" 1 1001 17 "Mouse" 2 1002 42 "Laptop" 1

product_name depends only on product_id, not on the full composite key (order_id, product_id). This is a partial dependency. If the product is renamed, every row must be updated.

Fix: move the partially-dependent column to its own table

order_items products ────────────────────────── ────────────────── order_id (PK) product_id (PK) product_id (PK + FK) ──────→ product_name quantity

Now product_name is stored once in products. Renaming a product requires a single update regardless of how many orders reference it.

Third Normal Form (3NF)

A table is in 3NF when it is in 2NF and every non-key column depends directly on the primary key — not on another non-key column. Dependencies between non-key columns are called transitive dependencies.

Violation example

employees ──────────────────────────────────────────────────── emp_id (PK) name dept_id dept_name dept_floor 1 Alice 10 Engineering 3 2 Bob 10 Engineering 3 3 Carol 20 Marketing 2

dept_name and dept_floor depend on dept_id, which is itself a non-key column. The chain is: emp_id → dept_id → dept_name. Renaming the Engineering department requires updating every employee row in that department.

Fix: extract the transitively-dependent columns

employees departments ────────────────────── ──────────────────────── emp_id (PK) dept_id (PK) name dept_name dept_id (FK) ──────────→ dept_floor

Each table now stores exactly one concern. dept_name lives in one place — a single update propagates everywhere via the foreign key.

A note on BCNF

Boyce-Codd Normal Form (BCNF) is a stricter variant of 3NF that handles edge cases involving multiple overlapping candidate keys. It rarely matters in practice — if your schema is in 3NF and you do not have tables with multiple composite candidate keys that share columns, you are effectively in BCNF already.

When to Denormalize

Normalization optimises for write correctness. Denormalization deliberately reintroduces redundancy to optimise for read performance. It is a conscious trade-off, not a design mistake.

Valid reasons to denormalize

  1. Read-heavy workloads — if a report joins ten tables on every page load, precomputing and storing the result in a summary table can reduce query time from seconds to milliseconds.
  2. Analytics / OLAP — data warehouses are intentionally denormalized into star or snowflake schemas because analytical queries scan millions of rows and joins are too expensive at that scale.
  3. Rarely-changing reference data — if a product name never changes, duplicating it in the orders table avoids a join at query time with minimal risk of inconsistency.
  4. Derived columns — storing a precomputed value (e.g. total_items on an order) avoids an aggregation query. Acceptable if you can keep it consistent via triggers or application logic.
NormalizedDenormalized
RedundancyNone — each fact stored onceIntentional — facts duplicated for read speed
Write costLow — update one placeHigher — must keep copies consistent
Read costHigher — JOINs requiredLower — data pre-joined or pre-aggregated
Best forOLTP, write-heavy, frequently changing dataOLAP, read-heavy, rarely changing data