Stacktrace#07cloudflare/pingora

The 10ms Tick Under Pingora

Round every deadline to a 10ms bucket, share one timer across everyone in that bucket, sweep it all from one thread. 27x cheaper than tokio.

7 min read

Cloudflare's nginx replacement burns through millions of timeouts a second, most of which never fire. Their fix: round every deadline to the next 10ms, share one timer across everyone waiting for that slot, and let a single background thread sweep the whole thing. Result: timeouts that cost 4 nanoseconds instead of 107.

A Quick Word On Pingora

If you haven't heard of it: pingora is the Rust HTTP proxy that Cloudflare wrote to replace nginx at their edge. They open-sourced it in 2024 after years of running it in production. It serves a meaningful slice of the internet, north of 40 million requests per second on a normal day, somewhere over a trillion requests every 24 hours.

Cloudflare's edge sits between you and most of the websites you visit. When you load a page, your browser is almost certainly talking to pingora first, and pingora is talking to the actual website on your behalf. So when we say it processes "a lot" of HTTP, we mean it. The numbers in this post aren't a microbenchmark trying to look good. They're real costs that show up in real bills when you multiply them by a few million per second per server.

What makes pingora interesting (and why we're reading its source) is that they didn't just rewrite nginx in Rust. They kept the parts of the design that worked and quietly threw out the parts that didn't scale to their workload. The timer is one of those parts. They saw nginx using a classic data structure, they saw tokio using a classic data structure, and they decided neither was cheap enough.

Every Connection Has a Timer

A proxy spends most of its time waiting. It waits for the client to finish sending. It waits for the upstream to respond. It waits for the body to drain. Each wait has a deadline: if nothing happens by then, give up.

In Rust async code that looks like:

let result = tokio::time::timeout(
  Duration::from_secs(5),
  read_from_socket(),
).await;

That timeout call is doing two things at once. It's polling your real work (the read), and in parallel it's arming a stopwatch. Whichever finishes first wins.

Now zoom out. Pingora runs at Cloudflare's edge. A single box is serving tens of thousands of concurrent connections, each one creating and tearing down little timers constantly: one for the read, one for the write, one for the keepalive, one for the upstream connect, one for the TLS handshake. Most of these timers are armed, the IO completes, and the timer is cancelled before it ever fires.

This is the worst possible workload for a stopwatch. You're creating it, you're never going to use it, and you're throwing it away. At a few hundred per second, fine. At millions per second, the stopwatch becomes the bottleneck.

What Tokio Does

Tokio's timer is good. It uses a hierarchical timer wheel, which is the textbook fast data structure for this.

A timer wheel is basically a clock with one bucket per tick. Every timer drops into the bucket for "when does it fire". The clock hand advances every tick and fires whatever's in the current bucket. That's the cheap part: amortized constant time to insert, constant time to fire one tick's worth. Hierarchical means it's actually several clocks stacked together (one for milliseconds, one for seconds, one for minutes, etc), so you can represent long deadlines without needing a million-bucket array.

0306090

tick → tick → tick

A timer wheel. The hand advances one bucket per tick. Whatever's in the current bucket fires together.

It's a great structure. It scales. It's what nginx uses, what HAProxy uses, what the Linux kernel uses for the same job. The problem isn't the structure. The problem is what happens around it.

Every timer is its own thing: a unique entry in the wheel, a unique waker registered, a unique cancellation path. You pay for that uniqueness whether or not you use it.

The pingora team measured it and put the numbers right in the source code comment:

pingora-timeout/src/timer.rsrust
//! Benchmark:
//! - create 7.809622ms total, 78ns avg per iteration
//! - drop:   1.348552ms total, 13ns avg per iteration
//!
//! tokio timer:
//! - create 34.317439ms total, 343ns avg per iteration
//! - drop:   10.694154ms total, 106ns avg per iteration

So just to make a timer that you probably won't use, tokio takes about 343 nanoseconds. Pingora takes 78. The end-to-end picture, including running the whole timeout() wrapper, is even more dramatic: 4ns per iteration vs 107ns. That's 27 times cheaper.

How? They cheated.

The Cheat: Round Everything to 10ms

The first move is mathematical. Take any deadline a caller asks for and snap it to the next 10ms boundary.

pingora-timeout/src/timer.rsrust
const RESOLUTION_MS: u64 = 10;
 
#[inline]
fn round_to(raw: u128, resolution: u128) -> u128 {
    raw - 1 + resolution - (raw - 1) % resolution
}

If you ask for a 53ms timeout, you get 60ms. If you ask for 101ms, you get 110ms. Ask for 1000ms, you get exactly 1000ms because it already lands on a boundary. The rounding always goes up, never down, so you never get a premature timeout.

The deliberate sacrifice

Pingora is giving up sub-10ms precision on timeouts. For an HTTP proxy this is basically free. A 5-second read timeout doesn't care if it fires at 5.003s or 5.012s. Nobody noticed and nobody cares. The pingora team noticed there was a knob there, and they turned it down to make everything else faster.

Now the key consequence. A request asking for "53ms from now" and another asking for "57ms from now" both get snapped to the same 60ms slot. They have the same deadline. They could share one timer.

TOKIOone timer per caller23mstimer27mstimer29mstimer41mstimer4 timersPINGORAround up to next 10ms · share the slot0ms10ms20ms30ms40ms50ms60ms70ms80ms90ms23ms27ms29ms1 shared timer41ms2 timers · 4 callers
Tokio gives every caller their own timer. Pingora rounds everyone up to the next 10ms slot and shares one timer per slot.

How Two Callers Become One Timer

The sharing is the second cheat. When you ask for a timer at a given deadline, pingora first looks to see whether someone else already wanted that exact same bucket. If they did, you just subscribe to their timer.

pingora-timeout/src/timer.rs (excerpt)rust
pub fn register_timer(&self, duration: Duration) -> TimerStub {
    let now: Time = (Instant::now() + duration - self.zero).into();
    {
        let timers = self.timers.get_or(...).read();
        if let Some(t) = timers.get(&now) {
            return t.subscribe();  // someone else already armed this slot
        }
    }
    let timer = Timer::new();
    // ... insert into the local tree ...
    timer.subscribe()
}

The internal Timer is basically a notifier with a single "did it fire?" boolean. When the timer eventually fires, every subscriber's future wakes up at the same time. A thousand callers waiting on the same 10ms slot? One Notify, one wakeup, one syscall worth of cost.

Reading the Rust here

subscribe() hands out a cheap handle (an Arc, basically a shared pointer) to the same underlying notifier. Cloning a handle is just bumping a reference count. There's no copying of any actual timer state.

The Single Sweeping Thread

So who actually fires these timers? One thread. One.

pingora-timeout/src/timer.rs (the clock thread)rust
pub(crate) fn clock_thread(&self) {
    loop {
        std::thread::sleep(RESOLUTION_DURATION);  // 10ms
        let now = (Instant::now() - self.zero).as_millis();
 
        for thread_timer in self.timers.iter() {
            let mut timers = thread_timer.write();
            // pop every entry whose deadline is <= now, fire it
            loop {
                let due = timers.iter().next()
                    .and_then(|(k, _)| if k.not_after(now) { Some(*k) } else { None });
                if let Some(k) = due {
                    timers.remove(&k).unwrap().fire();
                } else {
                    break;
                }
            }
        }
    }
}

The loop is dumb on purpose. Sleep for 10ms. Wake up. Walk every worker thread's sorted tree of pending timers. Fire anything whose deadline has passed. Go back to sleep.

Each worker thread has its own sorted tree of timers (a BTreeMap, which is just a sorted dictionary indexed by deadline). That's the third cheat.

worker 1BTreeMap<Time, Timer>DEADLINEWAITERS30ms×350ms×180ms×5210ms×1worker 2BTreeMap<Time, Timer>DEADLINEWAITERS40ms×270ms×4150ms×1300ms×2worker 3BTreeMap<Time, Timer>DEADLINEWAITERS30ms×160ms×2100ms×3180ms×1worker 4BTreeMap<Time, Timer>DEADLINEWAITERS50ms×690ms×1140ms×2220ms×4clock threadsleep(10ms) → walk every tree → fire what's due → repeat
Every worker owns its own sorted tree of pending timers. One clock thread sweeps all of them, 10ms at a time. Inserts never contend.

The whole point of giving each worker its own tree is that inserts never collide. A worker thread inserting a new timer takes a write lock on its own tree only. No other worker can contend with it. The only thread that ever touches another worker's tree is the clock thread, and only once every 10ms.

That comment in the source spells it out:

Usually we check if another thread has inserted the same node before we get the write lock, but because only this thread will insert anything to its local timers tree, there is no possible race that can happen.

This is a small piece of code with a huge implication. The hot path (a worker adding a timer) never blocks on anyone. The clock path (firing timers) blocks each worker's tree for the few microseconds it takes to drain the due entries. That's the entire concurrency story.

Self-Healing

A nice detail: the clock thread isn't started eagerly. The first call to timeout() checks whether a clock thread is alive, and if not, spawns one. If the thread ever dies (panics, gets stuck), the next caller will notice via a watchdog and respawn it.

pingora-timeout/src/timer.rsrust
const DELAYS_SEC: i64 = 2;
 
pub(crate) fn is_clock_running(&self) -> Result<(), i64> {
    let now = Instant::now().duration_since(self.zero).as_secs() as i64;
    let prev = self.clock_watchdog.load(Ordering::SeqCst);
    if now < prev + DELAYS_SEC { Ok(()) } else { Err(prev) }
}

The clock thread updates an atomic counter every tick. If 2 seconds pass without an update, the system assumes the thread is dead. Anyone trying to register a new timer will trip the check and start a fresh thread.

You don't have to think about it. You don't have to babysit it. The first timeout that gets created kicks the whole machine into life, and the machine keeps itself alive.

The Aside: Surviving fork()

There's one more cute thing in this file, and it's the kind of detail you only put in production code after you've been bitten.

pingora-timeout/src/timer.rsrust
/// Pause the timer for fork()
///
/// Because RwLock across fork() is undefined behavior, this function makes sure
/// that no one holds any locks.
pub fn pause_for_fork(&self) {
    self.paused.store(true, Ordering::SeqCst);
    std::thread::sleep(RESOLUTION_DURATION * 2);
}

If a worker is in the middle of holding a lock on its tree and you fork() the process, the child gets a snapshot of the lock in a held state but no thread holding it. That's a classic Unix footgun. Pingora's answer: before forking, set a flag that makes the clock thread stop touching locks, then sleep long enough that any in-flight register_timer call has finished. Two ticks, then fork. The Old World did graceful binary upgrades and so does pingora.

When Not To Use It

The pingora authors put this in the docstring, and I love them for it:

the benefits of this don't outweigh the overhead unless there are more than about 100 timeout() calls/sec in the system. Use regular tokio timeout in the low usage case.

That is what good infrastructure code looks like. Not "ours is always better." Not "rip out tokio." They built a faster timer because they had a workload that needed it, they measured the crossover point, and they told you not to use it below that point. The 10ms tick isn't a universal win. It's a deliberate trade priced for one specific workload: a proxy taking the bullets at the edge of a global network.

What To Steal

The pattern, not the code. Three moves stacked together:

  1. Quantize. Find the precision your caller secretly doesn't need, and round to it. The savings cascade.
  2. Share by deadline. Once you've quantized, the same deadline gets you the same handle. N callers, 1 timer, 1 wakeup.
  3. Push to a single sweeper. Don't let every timer cancel itself. Let one thread come around on a tick and clean up.

Apply that to anything that's "create something cheap, throw it away" at high volume. Rate limiters. Stats counters. Sampling decisions. Bloom filter rotations. Anywhere you'd reach for a stopwatch and a hash map, ask whether you really need a stopwatch per caller.


Source: pingora-timeout/src/timer.rs and pingora-timeout/src/fast_timeout.rs

Command Palette

Search for a command to run...