Rate Limiting

A rate limiter controls how many requests a client can make in a given time window. Without one, a single misbehaving client — or a traffic spike — can exhaust your backend and degrade service for everyone. The algorithm you choose determines how accurately limits are enforced, how much memory is consumed, and whether bursts are allowed.

Why Rate Limiting

Protect backend services — prevents any single client from overwhelming downstream services, databases, or third-party APIs with more requests than they can handle.
Prevent abuse — stops brute-force attacks, credential stuffing, scraping, and denial-of-service attempts at the API layer before they reach application code.
Enforce fair usage — ensures no single tenant consumes a disproportionate share of shared infrastructure, keeping the service responsive for all users.
Control costs — outbound API calls to third-party services often have per-request pricing. Rate limiting your own consumers prevents runaway spend.

Where to implement it

API Gateway — the most common placement. Enforced before any request reaches application code. AWS API Gateway, Kong, and Nginx all support built-in rate limiting.
Application layer — finer-grained control per endpoint or user tier. Implemented via middleware (e.g. express-rate-limit, Spring's bucket4j).
Distributed cache (Redis) — the backing store for most production rate limiters. Atomic operations (INCR, EXPIRE, Lua scripts) ensure correctness across multiple application instances without a race condition.

Fixed Window Counter

Divide time into fixed windows (e.g. one-minute buckets). Maintain a counter per client per window. Increment on each request; reject when the counter exceeds the limit. Reset the counter at the start of each new window.

Example

Limit: 100 requests per minute. Window resets at :00 each minute. A client sending 100 requests at :59 and another 100 at :01of the next minute makes 200 requests in two seconds — both windows allow it because each sees only 100 requests.

Trade-offs

Pros — simple to implement; O(1) memory per client; trivial to store in Redis with INCR and EXPIRE.
Cons — the boundary burst problem: a client can send 2× the allowed rate by timing requests at the window boundary. The limit is not accurately enforced at arbitrary points in time.

Sliding Window Log

Store a timestamped log of every request for each client (e.g. a Redis sorted set keyed by client ID, scored by timestamp). On each request, remove all entries older than the window duration, count the remaining entries, and reject if the count exceeds the limit.

Trade-offs

Pros — perfectly accurate; no boundary burst problem. The window truly slides with each request.
Cons — high memory cost. Every request is stored regardless of whether it was allowed or rejected. For a limit of 1,000 req/min per client, the log holds up to 1,000 entries per client at all times.

Sliding Window Counter

A hybrid of fixed window and sliding window log. Keep counters for the current and previous fixed windows. Estimate the count for the rolling window using a weighted combination:

count = prev_window_count × overlap_ratio + current_window_count

Where overlap_ratio is the fraction of the previous window that falls inside the current rolling window. For example, if the window is 1 minute and you are 40 seconds into the current window, the previous window contributes 20/60 ≈ 0.33 of its count.

Trade-offs

Pros — memory-efficient (only two counters per client); smooths out the boundary burst problem significantly; good approximation of a true sliding window.
Cons — the weighted estimate assumes uniform request distribution within a window, which is an approximation. Cloudflare's analysis found it to be off by at most 0.003% under real traffic — accurate enough for almost all use cases.

Token Bucket

Each client has a bucket with a maximum capacity of N tokens. Tokens are added to the bucket at a fixed refill rate (e.g. 10 tokens per second) up to the maximum. Each request consumes one token. If the bucket is empty, the request is rejected.

How it works

A client with a full bucket of 100 tokens can send a burst of 100 requests instantly, then is limited to the refill rate going forward.
Tokens accumulate during idle periods, allowing clients to "save up" capacity for a future burst — up to the bucket maximum.
Implementation: store the token count and the last refill timestamp. On each request, calculate how many tokens have been added since the last check, cap at the maximum, subtract one, and persist.

Trade-offs

Pros — allows controlled bursting, which suits APIs where clients naturally batch work. Smooth average rate with flexibility. Used by AWS and Stripe.
Cons — two parameters to tune (capacity and refill rate); bursts can still briefly stress downstream services if the bucket is large.

Leaky Bucket

Requests enter a queue (the "bucket") and are processed at a fixed outflow rate — like water leaking from a bucket at a constant drip. If the queue is full when a new request arrives, it is dropped. Unlike the token bucket, the leaky bucket enforces a strictly constant output rate regardless of input burst size.

How it works

Incoming requests are enqueued rather than processed immediately.
A processor drains the queue at a fixed rate (e.g. 100 requests/second), forwarding each to the backend.
If the queue is at capacity, new requests are rejected with a 429 Too Many Requests response.

Trade-offs

Pros — guarantees a smooth, constant output rate. Protects backend services from any burst — the backend sees only the configured rate, never more.
Cons — legitimate burst traffic is queued and delayed, not just counted. A sudden spike fills the queue, causing requests to wait or be dropped even if the long-term average is within limit. Not ideal when clients expect low-latency responses.

Comparison

Algorithm	Burst allowed	Memory	Accuracy	Best for
Fixed Window	Yes (at boundary)	O(1)	Low — boundary burst	Simple limits where boundary bursts are acceptable
Sliding Window Log	No	O(requests in window)	Exact	Strict accuracy requirements, low request volume
Sliding Window Counter	No	O(1)	High approximation	Most production APIs — best balance of accuracy and efficiency
Token Bucket	Yes (up to capacity)	O(1)	High	APIs where clients benefit from bursting (AWS, Stripe)
Leaky Bucket	No	O(queue size)	Exact output rate	Protecting backends that require a constant, smooth request rate