Back-Of-The-Envelope Estimation

Why it matters

Back-of-the-envelope estimation is the skill of quickly sizing a system using rough numbers and order-of-magnitude arithmetic. The goal is not precision — it is feasibility. You want to know whether a single database can handle the write load, whether the dataset fits in RAM, whether a CDN is necessary, and whether your architecture makes sense at the stated scale — all before writing a line of code or drawing a detailed diagram.

In a system design interview, estimation signals that you think at scale, not just at the happy path. In production engineering, it saves you from over-engineering solutions for problems you don't have and under-engineering for ones you do.

The process is always the same: state your assumptions explicitly, use round numbers, sanity-check the result against something you know, and adjust the architecture accordingly.

Key numbers to memorise

Powers of 2

Every storage and traffic number ultimately reduces to a power of 2. Knowing these cold means you can convert between units in your head.

PowerExact valueApproximateName
2¹⁰1,024~1 thousand1 KB
2²⁰1,048,576~1 million1 MB
2³⁰1,073,741,824~1 billion1 GB
2⁴⁰~1.1 × 10¹²~1 trillion1 TB
2⁵⁰~1.1 × 10¹⁵~1 quadrillion1 PB

Latency numbers every engineer should know

These are rough orders of magnitude (Jeff Dean's numbers, lightly updated). The key intuition: RAM is ~1,000× faster than SSD, SSD is ~10× faster than HDD, and a cross-datacenter round trip is ~300,000× slower than an L1 cache read.

OperationLatency
L1 cache reference0.5 ns
L2 cache reference7 ns
Main memory (RAM) reference100 ns
Read 1 MB sequentially from RAM250 µs
SSD random read150 µs
Read 1 MB sequentially from SSD1 ms
HDD seek10 ms
Read 1 MB sequentially from HDD20 ms
Round trip within same datacenter0.5 ms
Send 1 MB over 1 Gbps network10 ms
Round trip US West → US East40 ms
Round trip US → Europe150 ms

Data size quick reference

TypeSize
char1 byte
int4 bytes
long / double8 bytes
UUID16 bytes
Timestamp (Unix)4–8 bytes
Short text (tweet, URL)~200–500 bytes with metadata
Thumbnail image~20 KB
Compressed JPEG (profile photo)~50–100 KB
High-quality photo~300 KB
1-minute audio (compressed)~1 MB
1-minute HD video (compressed)~50 MB

Time conversions

The most useful conversion: there are roughly 86,400 seconds in a day. For estimation, round to 10⁵. This is the single most-used number in traffic calculations.

PeriodSecondsRounded
1 minute60~60
1 hour3,600~4 × 10³
1 day86,400~10⁵
1 month2,592,000~2.5 × 10⁶
1 year31,536,000~3 × 10⁷

Traffic estimation

Traffic estimation answers: how many requests per second must the system handle? Start from daily active users (DAU) and work down to queries per second (QPS).

The formula

QPS = (DAU × requests_per_user_per_day) / seconds_per_day

Peak QPS ≈ 2 × average QPS   (conservative)
Peak QPS ≈ 3 × average QPS   (for bursty workloads)

Read/write ratio

Most real systems are read-heavy. Always split QPS into reads and writes — they drive different infrastructure decisions. Writes hit the primary database; reads can be served from replicas or caches.

# Example: a social feed with 100:1 read/write ratio
Total QPS    = 10,000
Write QPS    = 10,000 / (100 + 1) ≈ 100 writes/sec
Read QPS     = 100 × 100           = 9,900 reads/sec

Scale benchmarks

ScaleDAUAvg QPS (10 req/user/day)Architecture hint
Small100K~10 QPSSingle server + single DB is fine
Medium10M~1,000 QPSLoad balancer, read replicas, basic cache
Large100M~10,000 QPSSharding, heavy caching, CDN, multiple regions
Massive1B~100,000 QPSFull distribution, custom infrastructure

Storage estimation

Storage estimation answers: how much disk space does the system need, and for how long?Always estimate for a multi-year horizon — five years is the standard in interviews.

The formula

Daily storage  = writes_per_day × record_size
Total storage  = daily_storage × retention_days

# Always add 20% overhead for indexes, replication, and metadata

Distinguishing storage types

  1. Metadata (SQL/NoSQL) — text fields, IDs, timestamps. Usually small (hundreds of bytes per record). Scales in GB to TB range even for large services.
  2. Media / blobs (object storage) — images, video, audio. Sizes jump by orders of magnitude (KB to MB per record). A service with photo uploads almost certainly needs object storage (S3, GCS) and a CDN — the database is not the right place for binary blobs.

A common mistake is lumping both together. Separate them — they require completely different infrastructure and the media number often dwarfs the metadata number by 100× or more.

Bandwidth & memory

Bandwidth

Bandwidth tells you the data throughput the system must sustain — important for CDN sizing, network provisioning, and understanding egress costs.

Inbound bandwidth  = write_QPS × average_request_size
Outbound bandwidth = read_QPS  × average_response_size

Cache memory sizing

The 80/20 rule (Pareto principle) holds reliably in caching: roughly 20% of the data receives 80% of the traffic. Caching that 20% eliminates most database reads.

Cache size = read_QPS × seconds_per_day × avg_response_size × 0.20

# Sanity check: does this fit on a single Redis instance?
# A typical Redis server holds 10–100 GB comfortably.
# If the number is larger, plan for a Redis cluster.

Replication factor

Always multiply your raw storage number by the replication factor before comparing it against infrastructure costs. A dataset stored with 3× replication costs 3× the disk. Standard factors: 3× for most databases, 2–3× for object storage (with erasure coding it can be lower).

Worked example: URL Shortener

A URL shortener converts a long URL into a short code (short.ly/abc123) and redirects visitors to the original. This is a read-heavy, write-light system.

Assumptions

  1. 500M new short URLs created per month → ~6M per day
  2. 100:1 read/write ratio
  3. Average URL length: 200 bytes; metadata per record: ~300 bytes total
  4. 5-year retention

Traffic

Write QPS  = 6,000,000 / 86,400        ≈ 70 writes/sec
Read QPS   = 70 × 100                 = 7,000 reads/sec
Peak reads ≈ 7,000 × 2               = ~14,000 reads/sec

Storage

Records over 5 years = 6M/day × 365 × 5   = ~11B records
Storage (metadata)   = 11B × 300 bytes    = ~3.3 TB
(+ 20% overhead)                          ≈ 4 TB

Bandwidth

Inbound  = 70 writes/sec × 300 bytes  ≈ 21 KB/s    (negligible)
Outbound = 7,000 reads/sec × 300 bytes ≈ 2.1 MB/s   (manageable)

Cache

Daily read volume = 7,000 reads/sec × 86,400 sec = ~600M reads/day
Cache top 20%     = 600M × 0.20 × 300 bytes          = ~36 GB

→ Fits comfortably on a single Redis instance (or small cluster).
  Cache hit ratio should be very high — popular short codes are
  accessed millions of times while the long tail is rarely used.

Architecture signals

  1. 14K peak reads/sec — a single database can handle this, but add a read replica and Redis cache to reduce latency below 5 ms.
  2. 70 writes/sec — trivially handled by any relational database.
  3. 4 TB metadata over 5 years — a single PostgreSQL instance can hold this; sharding is not needed yet.
  4. High read/write ratio → cache is the single most impactful optimisation.

Worked example: Social Feed

A Twitter-like microblogging platform where users post short text messages and browse a personalised feed of posts from people they follow. This is a read-heavy, write-moderate, fan-out-heavy system.

Assumptions

  1. 300M DAU
  2. 20% of users post daily → 60M posts/day
  3. Average post: 200 bytes of text + 100 bytes metadata = 300 bytes
  4. 10% of posts include an image (average 300 KB compressed)
  5. Each user reads their feed 5× per day, loading 20 posts per load
  6. Average user has 200 followers
  7. 5-year retention for posts; media stored indefinitely in object storage

Traffic

# Writes (posts)
Write QPS      = 60,000,000 / 86,400         ≈ 700 posts/sec
Peak write QPS ≈ 700 × 3                     ≈ 2,100 posts/sec

# Reads (timeline)
Timeline requests/day = 300M users × 5 reads = 1.5B requests/day
Read QPS              = 1,500,000,000 / 86,400 ≈ 17,000 reads/sec
Peak read QPS         ≈ 17,000 × 2            ≈ 34,000 reads/sec

# Fan-out writes (push model: pre-write to each follower's feed cache)
Fan-out writes/sec = 700 posts/sec × 200 followers = 140,000 writes/sec
→ This is why Twitter-scale systems use hybrid push/pull fan-out

Storage

# Text metadata
Posts over 5 years    = 60M/day × 365 × 5     = ~110B posts
Text storage          = 110B × 300 bytes       = ~33 TB
(+ 20% overhead)                               ≈ 40 TB

# Media (images)
Image uploads/day     = 60M posts × 10%       = 6M images/day
Media storage/day     = 6M × 300 KB           = ~1.8 TB/day
Media over 5 years    = 1.8 TB × 365 × 5      ≈ 3.3 PB

→ Text metadata → sharded relational or wide-column DB (Cassandra)
→ Media         → object storage (S3/GCS) + CDN — the DB never sees images

Bandwidth

# Inbound (uploads)
Text writes  = 700/sec × 300 bytes            ≈ 210 KB/s
Image writes = 6M images / 86,400 sec × 300 KB ≈ 20 GB/s  ← dominant

# Outbound (feed reads)
Per request  = 20 posts × 300 bytes           = 6 KB (text only)
Total text   = 17,000 reads/sec × 6 KB        ≈ 100 MB/s
Image views  = much larger → offload entirely to CDN

Cache

# Hot posts (top 20% get 80% of reads)
Daily read volume  = 17,000 reads/sec × 86,400 × 6 KB = ~8.8 TB reads/day
Cache top 20%      = 8.8 TB × 0.20                    = ~1.76 TB

→ Too large for a single Redis instance.
→ Redis Cluster across multiple nodes, or limit cache to
  the top-N trending posts + each user's pre-computed feed.

Architecture signals

  1. 34K peak read QPS — well beyond a single database; requires sharding + heavy caching.
  2. 140K fan-out writes/sec — pure push model is infeasible for celebrity accounts with millions of followers. Use a hybrid model: push to followers with< 10K followers; pull and merge for celebrity accounts at read time.
  3. 3.3 PB of media — object storage is mandatory; no relational database stores blobs at this scale economically.
  4. 1.8 TB/day image ingest — CDN with origin offload + async processing pipeline for resizing thumbnails (message queue → worker fleet → object storage).
  5. Cache is essential but too large for a single node — pre-compute timelines and store in distributed cache keyed by user ID.