Token bucket
What You Are Building
A token bucket rate limiter that supports per-key state, an injected clock, and concurrent Allow calls. By the end of this stage one process can rate-limit thousands of identities (users, IPs, API keys) from a single in-memory map without ever calling time.Sleep and without two goroutines, threads, or simultaneous promises ever combining to grant more than capacity tokens in a single instant.
Stage 1 is single-process and uses one algorithm. The Limiter interface that lets stage 2's sliding-window counter slot in next to this implementation arrives in stage 2. The distributed counter backend arrives in stage 3.
Why This Matters
Token bucket is the rate-limiting algorithm most production systems use. AWS API Gateway uses it. Stripe layers it underneath their sliding-window counter. Discord's per-route limits are token buckets. The reasons it wins:
- It supports bursts naturally. A bucket that has been idle for a while refills to capacity and absorbs a short flood without rejecting requests.
- It can compute an exact
Retry-Afteron denial. The wait is(1 - tokens) / rate. Sliding-window approximates; token bucket is precise. - The state per key is two numbers (
tokens,last_refill) plus a parameter pair (capacity,rate). At a million distinct keys it is a few megabytes of memory.
The naive implementations get one of three things wrong. They refill on a background ticker (which couples capacity to scheduler jitter and makes tests fragile). They store tokens as an integer (which loses the fractional refill credit between calls and skews the rate downward). Or they lock too coarsely (which makes a hot key the global bottleneck). This stage gets all three right.
The Interface You Implement
Stage 1 ships the token bucket type, the Decision shape, and an injected clock option. Every stage from here on returns the same Decision shape.
Go (ratelimit/ratelimit.go):
package ratelimit
import (
"context"
"time"
)
type Decision struct {
Allowed bool
RetryAfter time.Duration
Remaining int
}
type TokenBucket struct { /* unexported */ }
type Option func(*config)
// WithClock injects a clock function for tests. Defaults to time.Now.
func WithClock(now func() time.Time) Option
// NewTokenBucket returns a token bucket with the given capacity and refill rate.
func NewTokenBucket(capacity int, refillPerSec float64, opts ...Option) *TokenBucket
// Allow returns the decision for one unit-cost call against the given key.
func (tb *TokenBucket) Allow(ctx context.Context, key string) (Decision, error)Python (ratelimit/__init__.py):
from dataclasses import dataclass
from typing import Callable
@dataclass
class Decision:
allowed: bool
retry_after: float # seconds
remaining: int
class TokenBucket:
def __init__(
self,
capacity: int,
refill_per_sec: float,
*,
clock: Callable[[], float] | None = None,
) -> None: ...
def allow(self, key: str) -> Decision: ...clock returns seconds since an arbitrary epoch (monotonic in production, advanced manually in tests).
TypeScript (src/ratelimit.ts):
export interface Decision {
allowed: boolean
retryAfterMs: number
remaining: number
}
export interface LimiterOptions {
clock?: () => number // milliseconds since epoch
}
export class TokenBucket {
constructor(capacity: number, refillPerSec: number, opts?: LimiterOptions)
allow(key: string): Decision
}retryAfterMs is milliseconds; the Go and Python variants use seconds with sub-second precision. Same information, language-idiomatic units.
What the Tests Check
Five tests gate this stage. Each catches a specific failure mode.
FullBucketAllowsCapacity. Withcapacity = 10, the first ten consecutiveAllowcalls returnallowed=true. The eleventh returnsallowed=falsewithretry_after > 0. Catches a missing or off-by-one capacity check.DenialCarriesRetryAfter. Every denial returns a positiveretry_after. Catches an implementation that returnsfalsewithretry_after=0, which would push clients into a tight retry loop.RefillRestoresTokens. After capacity is drained, advancing the injected clock by1 / rateseconds restores exactly one token. The nextAllowsucceeds. Catches missing lazy refill and integer truncation of fractional tokens.PerKeyIsolated. Two distinct keys do not share state. Draining keyadoes not affect keyb. Catches an accidental global counter.ConcurrentAllowGrantsAtMostCapacity. 100 goroutines / threads / promises race to callAllowon the same key, against an empty bucket pre-filled withcapacity = 50. Exactly 50 of them getallowed=true, never more. Catches a missing atomic CAS, a missing mutex, or a check-then-act race.
Run mise run stage 1. The concurrent test runs under -race in Go and is informative-only in Python.
Guidance
The wrong implementation runs a background goroutine that adds tokens to every bucket at 1 / rate second intervals. Three problems:
- It scales with the number of distinct keys, not with the number of
Allowcalls. - The exact refill timing depends on scheduler jitter, which makes the
RefillRestoresTokenstest flaky. - The background goroutine must be carefully cleaned up or it leaks.
The right implementation does lazy refill on the Allow call:
elapsed = clock() - last_refill
new_tokens = elapsed * rate
tokens = min(capacity, tokens + new_tokens)
last_refill = clock()
No background work, no scheduler dependency, no leak.
If you store tokens as an integer, the lazy refill silently rounds down between calls. With rate = 10/sec and Allow calls 50 milliseconds apart, each refill adds 0.5 tokens, which truncates to zero. The effective rate drops to zero and the bucket never refills.
Store tokens as a float64 (Go), float (Python), or number (TypeScript). The Decision.remaining field can still be an integer; cast on the way out.
For concurrent Allow calls against the same key you need to make read-decide-write atomic. Two paths:
- Mutex per key. Simple. A
sync.Mutex(Go),threading.Lock(Python), or no-op (single-threaded JavaScript). Held only for the duration of the lazy refill plus the token deduction. Microseconds. Fine for most workloads. - Compare-and-swap on (tokens, last_refill) as a single word. Faster under high contention on a single key. More complex; only worth it if your profile shows the mutex is a bottleneck.
Karnstack tests do not care which you pick. Pick the simpler one.
In single-threaded JavaScript runtimes, Allow is naturally atomic since no other code can run between the read and the write. The ConcurrentAllowGrantsAtMostCapacity test in the TypeScript starter uses Promise.all over a hundred allow calls; the test still passes without a lock because the event loop serializes them.
If you ship this code in a worker pool or share state across processes with SharedArrayBuffer, you need real atomics. The stage-3 distributed counter backend addresses the multi-process case.
Common Pitfalls
-
The
RefillRestoresTokenstest fails by exactly one token. You storedtokensas an integer. Switch to floating point and only cast to integer inremaining. -
The
DenialCarriesRetryAftertest fails becauseretry_after = 0on denial. You computedretry_after = (capacity - tokens) / rateinstead of(1 - tokens) / rate. The wait is the time to refill enough for the next single call, not enough to fully top up. -
The
ConcurrentAllowGrantsAtMostCapacitytest fails with the count drifting above capacity (e.g. 53 out of 100). Classic check-then-act. Your code readstokens, checks>= 1, then writestokens - 1. Between the read and the write another goroutine did the same. Wrap the read-decide-write in a mutex or use a CAS loop. -
The
PerKeyIsolatedtest fails because all keys share state. You forgot the per-key map. The bucket holdsmap[string]*bucketState, not a singlebucketState. The map itself needs synchronization (a top-levelsync.RWMutexor async.Map), separate from the per-key mutex. -
mise run stage 1passes locally but flakes on CI. Your test usestime.Sleepinstead of advancing the injected clock. Replace the sleep withclock.Advance(d)on the test'sFakeClock.
Bonus
- Add an optional
costparameter toAllowso a non-unit operation (a multi-record write that should count as ten) can deduct ten tokens. Tests do not exercise this; production uses it heavily. - Replace the per-key map with an LRU bounded at
maxKeysto prevent unbounded growth when distinct identifiers churn. Stage 5 mentions this; you can land it here. - Benchmark
Allowagainst a hot key. A clean mutex implementation runs in well under one microsecond on a modern laptop. The atomic-CAS version is roughly twice as fast under heavy contention.
What You Built
A token bucket with per-key state, lazy refill, atomic decisioning, and a clean Decision return shape. The whole implementation is around a hundred lines. The next stage introduces a second algorithm, the sliding-window counter, and pulls both behind a common Limiter interface so the same test suite runs against either.
Continue to Stage 2: Sliding-Window Counter.