Database Normalization
Learn database design principles, normal forms (1NF, 2NF, 3NF), and how to structure relational data efficiently.
Table of Contents
Why Normalize Databases?
Database normalization is the process of organizing data to minimize redundancy and dependency. It divides large tables into smaller ones and defines relationships between them.
Benefits of Normalization
- Eliminates redundancy: Store each piece of data once
- Ensures consistency: Updates happen in one place
- Reduces storage: Less duplicate data
- Easier maintenance: Clear structure, less confusion
- Better data integrity: Fewer anomalies
Problems with Unnormalized Data
| OrderID | CustomerName | CustomerEmail | Products | Prices |
|---|---|---|---|---|
| 1 | John Doe | [email protected] | Laptop, Mouse | $999, $25 |
| 2 | John Doe | [email protected] | Keyboard | $75 |
Problems:
- Customer data duplicated (update anomaly)
- Multiple products in one cell (violates atomicity)
- Hard to query individual products
Update Anomaly: Change customer email in multiple rows
Insert Anomaly: Can't add customer without an order
Delete Anomaly: Deleting order loses customer data
Keys Explained
Primary Key
Uniquely identifies each row in a table. Cannot be NULL.
CREATE TABLE Users (
user_id INT PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL
);
Foreign Key
Links tables together. References primary key in another table.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
user_id INT,
FOREIGN KEY (user_id) REFERENCES Users(user_id)
);
Composite Key
Primary key made of multiple columns:
CREATE TABLE Enrollments (
student_id INT,
course_id INT,
PRIMARY KEY (student_id, course_id)
);
First Normal Form (1NF)
Rule: Each column must contain atomic (indivisible) values, and each row must be unique.
Before 1NF (Violates Atomicity)
| StudentID | Name | Courses |
|---|---|---|
| 1 | Alice | Math, Physics, Chemistry |
After 1NF (Atomic Values)
| StudentID | Name | Course |
|---|---|---|
| 1 | Alice | Math |
| 1 | Alice | Physics |
| 1 | Alice | Chemistry |
Requirements for 1NF:
- Each column contains single values (no lists, no arrays)
- All entries in a column are the same type
- Each row is unique (has a primary key)
- Order of rows doesn't matter
Second Normal Form (2NF)
Rule: Must be in 1NF AND all non-key columns must depend on the ENTIRE primary key (no partial dependencies).
Only applies to tables with composite primary keys.
Before 2NF (Partial Dependency)
| StudentID | CourseID | StudentName | CourseName |
|---|---|---|---|
| 1 | 101 | Alice | Math |
| 1 | 102 | Alice | Physics |
Problem: StudentName depends only on StudentID, not on (StudentID, CourseID)
After 2NF (Split Into Three Tables)
Students Table:
| StudentID | StudentName |
|---|---|
| 1 | Alice |
Courses Table:
| CourseID | CourseName |
|---|---|
| 101 | Math |
| 102 | Physics |
Enrollments Table:
| StudentID | CourseID |
|---|---|
| 1 | 101 |
| 1 | 102 |
Third Normal Form (3NF)
Rule: Must be in 2NF AND no transitive dependencies (non-key columns depend only on primary key, not on other non-key columns).
Before 3NF (Transitive Dependency)
| EmployeeID | Name | DepartmentID | DepartmentName |
|---|---|---|---|
| 1 | John | 5 | Sales |
| 2 | Jane | 5 | Sales |
Problem: DepartmentName depends on DepartmentID, not directly on EmployeeID
After 3NF (Remove Transitive Dependency)
Employees Table:
| EmployeeID | Name | DepartmentID |
|---|---|---|
| 1 | John | 5 |
| 2 | Jane | 5 |
Departments Table:
| DepartmentID | DepartmentName |
|---|---|
| 5 | Sales |
1NF: Atomic values, unique rows
2NF: 1NF + no partial dependencies
3NF: 2NF + no transitive dependencies
Most applications stop at 3NF (good balance)
When to Denormalize
Sometimes, intentionally violating normalization improves performance:
Reasons to Denormalize
- Read-heavy applications: Reduce JOINs for faster queries
- Reporting databases: Pre-aggregate data
- Caching layers: Store computed values
- NoSQL databases: Optimized for document storage
Add order_total column to Orders table instead of calculating from OrderItems each time:
-- Normalized (requires JOIN and SUM)
SELECT o.order_id, SUM(oi.price * oi.quantity) as total
FROM Orders o
JOIN OrderItems oi ON o.order_id = oi.order_id
GROUP BY o.order_id;
-- Denormalized (direct access)
SELECT order_id, order_total FROM Orders;
Trade-offs
| Normalized | Denormalized | |
|---|---|---|
| Storage | Less space | More space |
| Consistency | Easier | Harder |
| Read Speed | Slower (JOINs) | Faster |
| Write Speed | Faster | Slower (multiple updates) |
Practical Guidelines
For OLTP (Transactional) Systems
- Normalize to 3NF for data integrity
- Use foreign keys to enforce relationships
- Index frequently queried columns
- Denormalize only for proven performance issues
For OLAP (Analytics) Systems
- Star schema (fact tables + dimension tables)
- Denormalize for query performance
- Use materialized views for aggregations
- ETL processes handle data consistency
Common Patterns
One-to-Many:
Users (user_id, name)
Orders (order_id, user_id, order_date)
-- One user has many orders
Many-to-Many (requires junction table):
Students (student_id, name)
Courses (course_id, name)
Enrollments (student_id, course_id, enrolled_date)
-- Many students in many courses
One-to-One (rare):
Users (user_id, username)
UserProfiles (user_id, bio, avatar)
-- Each user has exactly one profile
BCNF (Boyce-Codd NF): Stricter version of 3NF
4NF: Eliminates multi-valued dependencies
5NF: Eliminates join dependencies
Most projects don't need beyond 3NF
Database normalization is about finding the right balance between data integrity and performance. Start normalized, denormalize when measurements prove it's necessary.