Stacktrace#06tigerbeetle/tigerbeetle

TigerBeetle Would Rather Crash Than Guess

Two asserts in a financial database. One dies if the clock regresses. One dies if you allocate after startup. Same philosophy.

Apr 27, 20267 min read

written byKarn

TigerBeetle is a database for moving money. I went digging through its source expecting clever tricks. What I found was stranger: a database engineered to die on contact with ambiguity.

What Even Is TigerBeetle?

If you haven't heard of it: TigerBeetle is an open-source financial database. Not a general-purpose one. It does one thing - double-entry accounting transfers - and it does that thing about a thousand times faster than Postgres can.

It's written in Zig. Zero dependencies. The whole binary is around 4 MB.

The team behind it is famously paranoid. They wrote their own consensus protocol, their own LSM storage engine, their own deterministic simulator that runs years of simulated network partitions in minutes. Their public engineering tone is somewhere between "NASA" and "you think this is a joke?".

So when I cloned the repo to look for something to write about, I expected to find the simulator, or some exotic LSM trick. Instead I kept tripping over a much simpler pattern. Two files. Two asserts. Same idea.

Let me show you.

Assert #1: The Clock That Refuses To Lie

Open src/time.zig. Scroll to line 44. This is what's there:

pub const TimeOS = struct {
    /// Hardware and/or software bugs can mean that the monotonic clock may regress.
    /// One example (of many): https://bugzilla.redhat.com/show_bug.cgi?id=448449
    /// We crash the process for safety if this ever happens, to protect against infinite loops.
    /// It's better to crash and come back with a valid monotonic clock than get stuck forever.
    monotonic_guard: u64 = 0,

That field, monotonic_guard, is just a u64. Initially zero. Every time anyone in TigerBeetle asks "what time is it on this machine?", it goes through this function:

fn monotonic(context: *anyopaque) u64 {
    const self: *TimeOS = @ptrCast(@alignCast(context));
 
    const m = blk: {
        if (is_windows) break :blk monotonic_windows();
        if (is_darwin) break :blk monotonic_darwin();
        if (is_linux) break :blk monotonic_linux();
        @compileError("unsupported OS");
    };
 
    // "Oops!...I Did It Again"
    if (m < self.monotonic_guard) @panic("a hardware/kernel bug regressed the monotonic clock");
    self.monotonic_guard = m;
    return m;
}

Read that again. It reads the OS clock. It checks: did the new reading come in before the previous one? If yes, the entire process panics.

The Britney Spears reference ("Oops!...I Did It Again") is a comment on a single line of production code in a database that holds people's money. I love it.

Wait, The Clock Can Go Backwards?

This is the part that catches people the first time they see it. Yes. The monotonic clock - the one that's supposed to only ever move forward - can go backwards.

It happens because of:

Cause	What Goes Wrong
CPU TSC drift on multi-socket machines	Different cores read slightly different times
Buggy hypervisors	VM migrations leak time across hosts
Kernel scheduler bugs	The bug TigerBeetle links to: bugzilla 448449
Cheap virtualization	Containers and VMs can drift on every wakeup

Most software ignores this. The standard reasoning is: "monotonic means monotonic, the kernel guarantees it, if it lies that's a kernel bug, not my problem".

TigerBeetle's reasoning is the opposite: "the kernel will lie eventually. When it does, the rest of our code assumes time only moves forward. Loops will spin. Timeouts will misfire. Replicas will disagree about ordering. It's better to die now than to keep running on a clock we no longer trust."

Crash early, restart clean, let the supervisor get you back to a known-good state. That's the bet.

Assert #2: The Allocator That Locks Its Own Door

Now flip over to src/static_allocator.zig. Different file. Different mechanism. Same energy.

The header comment tells you everything:

//! An allocator wrapper which can be disabled at runtime.
//! We use this for allocating at startup and then
//! disable it to prevent accidental dynamic allocation at runtime.

A normal allocator is happy to give you memory whenever you ask. Forever. TigerBeetle wraps theirs in a tiny state machine:

const State = enum {
    /// Allow `alloc` and `resize`.
    init,
    /// Don't allow any calls.
    static,
    /// Allow `free` but not `alloc` and `resize`.
    deinit,
};

Three states. The transitions are explicit:

pub fn transition_from_init_to_static(self: *StaticAllocator) void {
    assert(self.state == .init);
    self.state = .static;
}
 
pub fn transition_from_static_to_deinit(self: *StaticAllocator) void {
    assert(self.state == .static);
    self.state = .deinit;
}

And here's the punchline. Every single alloc call goes through this:

fn alloc(ctx: *anyopaque, len: usize, ptr_align: Alignment, ret_addr: usize) ?[*]u8 {
    const self: *StaticAllocator = @ptrCast(@alignCast(ctx));
    assert(self.state == .init);
    return self.parent_allocator.rawAlloc(len, ptr_align, ret_addr);
}

If the database is past startup and someone tries to allocate even one byte, the assert fails. The process dies.

Why Would Anyone Do This?

Because TigerBeetle is a real-time financial system, and real-time systems have one rule the rest of us ignore: if the latency of an operation depends on the heap, you don't actually have a real-time system.

Watch what happens in a typical service when you allocate on the hot path:

Heap fragmentation grows over hours.
Eventually some malloc triggers a mmap to grow the heap.
That request takes 10x longer than every other request that day.
A timeout somewhere fires. A retry kicks in. A queue backs up.
At 3am you're paged because p99 spiked and nobody can tell you why.

TigerBeetle refuses to play that game. Every buffer, every data structure, every scratch space is allocated during startup. After transition_from_init_to_static() runs, the heap shape is frozen. Steady-state behavior is fully predictable. There is no "what if we get unlucky with malloc" because malloc is no longer reachable.

And because there are still a few legitimate uses of free during shutdown (cleanup paths, errdefer chains), free quietly flips the state to .deinit the moment it's called. The comment in the source file is one line:

// Once you start freeing, you don't stop.
self.state = .deinit;

Great line. There's no going back from teardown.

Same Pattern, Different File

Look at the two asserts side by side and the philosophy snaps into focus:

// time.zig
if (m < self.monotonic_guard) @panic("a hardware/kernel bug regressed the monotonic clock");
 
// static_allocator.zig
assert(self.state == .init);

Different problems. Same answer.

In both cases, the "nice" thing to do is to keep running. Coerce the bad clock value. Allocate the memory. Log a warning. Move on.

In both cases, TigerBeetle picks the opposite. It treats every assert as a statement of fact about how the system is allowed to behave - not a safety net, but an invariant. If reality stops matching the invariant, reality is the bug, and the only honest response is to stop.

This is why TigerBeetle code reads the way it does. It is saturated with asserts. The codebase has somewhere north of 6,000 of them. Every function declares what it expects of the world before it does any work, and refuses to continue if those expectations don't hold. The simulator's whole job is to find a sequence of inputs that makes any assert anywhere fire.

The clock guard and the allocator state machine are just two of the most quotable examples. They're not defensive programming. They're load-bearing.

What I Learned

Asserts are not safety nets

Most codebases sprinkle asserts in as a "just in case" backstop. TigerBeetle treats them as the primary specification of correctness. The code is the spec. If the code lies, it dies.

Takeaway: If your assert can be silently caught and recovered from, it isn't an assert. It's a log line wearing a costume.

Crash-only is a feature, not a failure mode

Modern infra (kubernetes, systemd, supervisors) makes process restarts cheap. TigerBeetle leans hard into this: the safest action when reality contradicts your model is to die fast and let the supervisor bring you back to a known-good state.

Takeaway: "What happens if this assumption breaks?" is a stronger question than "How do we recover gracefully?" You don't recover from corruption. You restart from a checkpoint.

A constraint is also an architecture

Disallowing alloc after startup sounds like a footnote. In practice it shapes every data structure in the codebase. There are no growable arrays on the hot path. There are no hash maps that resize themselves. Every buffer is sized at boot from configuration, and the system either fits or refuses to start.

Takeaway: A small, harshly-enforced rule will rewrite a codebase more thoroughly than any style guide.

One More Thing

The Britney Spears comment is not the only joke in the codebase. The static allocator's "once you start freeing, you don't stop" reads like a bouncer's house rule. There's a function elsewhere literally named transition_from_init_to_static. The simulator is sometimes called the VOPR.

Underneath all of it is a team that takes correctness deadly seriously and refuses to take themselves seriously while doing it. That combination is rarer than it should be, and it's why TigerBeetle is worth reading even if you'll never run a financial database.

Two asserts. One philosophy. The whole thing fits in 100 lines of Zig.

What's the most paranoid assert you've found in production code? I'd love to hear about it.

Sources:

Command Palette