Stacktrace#05openclaw/openclaw

The AI Brain That Remembers, Sees, and Fails Over

Inside OpenClaw's codebase: retry loops, 6-layer tool permissions, hybrid memory with temporal decay, and cross-provider model failover.

10 min read

Everyone's shipping AI agents right now. Most of them break the moment an API goes down. I spent a week reading OpenClaw's codebase - with an AI doing the heavy lifting - to understand why it doesn't. Here's what we found.

Why This Matters Right Now

The personal AI agent market is on fire. Gartner predicts 40% of enterprise apps will embed task-specific agents by the end of this year, up from less than 5% in 2025. The U.S. market alone is projected at $46.3 trillion by 2033 (Grand View Research). Everyone from Meta (which acquired Manus AI for ~$2B in early 2026) to solo developers spinning up LangGraph pipelines wants a piece of it.

But here's the dirty secret: production agents are far less reliable than benchmarks suggest. ReliabilityBench (Jan 2026) found that even modest task perturbations drop agent success rates from 96.9% to 88.1% - and that's before you introduce real-world faults like rate limits, timeouts, and partial API responses. Most agents are demos that fall apart in production. You've probably felt this - you ask your AI assistant something, it confidently starts a task, an API hiccups somewhere, and suddenly you're staring at a stack trace and a half-written email.

OpenClaw is an open-source personal AI agent that runs on your own devices. Not a hosted service - a local gateway that connects to your messaging channels and routes everything through an AI brain. It's had its share of security drama (CVE-2026-25253 affected 42K+ instances, and the "ClawJacked" vulnerability let any website hijack your agent with zero interaction). But the architecture underneath? That's genuinely interesting.

The security team patched both within 24 hours. But the real story isn't the vulnerabilities - it's what the codebase reveals about building agents that actually stay up.

I'll be honest: I didn't read 1,600 lines of TypeScript by hand. I pointed an AI at the codebase and asked it to find the interesting bits. Then I verified everything myself. It's a good workflow - the AI finds the needles, you check if they're actually needles. What follows is what survived that filter.

The Big Picture

Here's the mental model:

LayerWhat it does
ChannelsWhatsApp, Telegram, Discord, Slack, Signal, iMessage, Matrix, MS Teams, IRC, and 15 more - each one is a plugin
GatewayA WebSocket control plane that receives messages, routes them, manages sessions
AgentThe brain - assembles context, picks tools, calls the model, streams the response back

The gateway is cool (36 channel plugins, 7-tier message routing, Proxy-based lazy loading), but the brain is where the real complexity lives. That's what we're digging into.

The Agent Run: Retry Loops All the Way Down

When a message arrives, here's what kicks off inside runEmbeddedPiAgent:

return enqueueSession(() =>
  enqueueGlobal(async () => {
    // 1. Resolve workspace, model, auth
    // 2. Load context engine
    // 3. Enter the attempt loop
  }),
);

Session and global lanes serialize work. No two messages for the same session run at the same time. No two global tasks overlap. It's a queue, not a thread pool.

This matters more than it sounds. Most agent frameworks just throw async calls at the wall. OpenClaw treats concurrency like a database treats transactions - ordered and safe.

Inside the attempt loop:

  1. Load skills - Markdown files that teach the agent how to use CLIs and workflows (the agent literally learns from prose)
  2. Compose tools - 30+ tools, filtered through 6+ policy layers (more on this below)
  3. Build the system prompt - skills, memory recall, authorized senders, workspace info, current time
  4. Stream the response - tool calls and text, back to whichever channel the message came from
  5. On error - and this is where it gets good

Errors don't crash the run. They redirect it:

  • Auth failure? Rotate to the next API key (advanceAuthProfile)
  • Context overflow? Compact the conversation (summarize old messages, keep recent ones)
  • Model down? Throw FailoverError, let the outer loop try a different provider entirely

The whole thing is an attempt loop (retry with the same model, rotate keys, compact context) nested inside a failover loop (switch models when the provider is dead). The agent keeps swinging until something works.

This is exactly what Sierra calls a "Congestion-Aware Provider Selector" in their production systems, and what TrueFoundry productized as TrueFailover. OpenClaw arrived at the same architecture from the open-source side.

The Tool Pipeline: Same Brain, Different Permissions

This is the part that made me stop scrolling.

The AI has access to 30+ tools: file I/O, shell execution, browser control (Playwright), interactive HTML canvases, web search, cron scheduling, messaging, memory search, image generation, PDF handling, TTS, and subagent spawning.

The obvious question: who decides which tools the AI actually gets?

Most agent frameworks answer this with a flat list. Here's your tools, go nuts. OpenClaw answers it with a 6-layer pipeline, and the layering is where the design gets smart.

Layer 1: Composition

createOpenClawCodingTools builds the full set. Everything that could possibly be useful starts here.

const tools: AnyAgentTool[] = [
  ...base, // read, write, edit (with sandbox variants)
  execTool,
  processTool,
  ...listChannelAgentTools({ cfg: options?.config }),
  ...createOpenClawTools({
    // browser, canvas, nodes, cron, message, tts,
    // gateway, agents, sessions, subagents, web_search,
    // web_fetch, image, pdf...
  }),
]

Layer 2: Provider-Specific Filtering

Not every model handles every tool the same way:

  • Voice callers don't get tts (feedback loop - the AI would read its own output aloud)
  • xAI/Grok doesn't get web_search (conflicts with its built-in search)
  • apply_patch only works with OpenAI-family models

Layer 3: Authorization

Here's where it gets clever. Non-owners lose access to dangerous tools like cron, gateway, nodes, and whatsapp_login. But the tools aren't removed. They're wrapped with applyOwnerOnlyToolPolicy:

const toolsByAuthorization = applyOwnerOnlyToolPolicy(
  toolsForModelProvider,
  senderIsOwner,
)

The tools still show up in the model's tool list. The model can see them but can't use them - calling one throws an error. This way the model knows the capability exists but is denied, so it can tell the user "I can't do that for you" instead of acting confused about something it's never heard of.

Subtle, but it makes a real difference in conversation quality.

Layer 4: Policy Pipeline

This is the most layered part. Eight policy sources stack up:

applyToolPolicyPipeline and buildDefaultToolPolicyPipelineSteps stack the policies:

const subagentFiltered = applyToolPolicyPipeline({
  tools: toolsByAuthorization,
  steps: [
    ...buildDefaultToolPolicyPipelineSteps({
      profilePolicy,
      providerProfilePolicy,
      globalPolicy,
      globalProviderPolicy,
      agentPolicy,
      agentProviderPolicy,
      groupPolicy,
      agentId,
    }),
    { policy: sandbox?.tools, label: "sandbox tools.allow" },
    { policy: subagentPolicy, label: "subagent tools.allow" },
  ],
})

Profile policies, provider-specific policies, agent policies, group policies, sandbox policies, subagent policies. Each one can allow or deny tools. A nice touch: if an allowlist only references plugin tools, it gets stripped so core tools stay available. You'd use tools.alsoAllow for additive behavior.

Layer 5: Schema Normalization

Different model providers have opinions about JSON schemas. OpenAI rejects root-level unions. Gemini needs constraint keywords stripped. Anthropic expects them. Each tool's parameter schema gets massaged per provider via normalizeToolParameters:

const normalized = subagentFiltered.map((tool) =>
  normalizeToolParameters(tool, {
    modelProvider: options?.modelProvider,
    modelId: options?.modelId,
  }),
)

Layer 6: Wrapping

Every tool gets wrapped twice. First with a beforeToolCall hook (wrapToolWithBeforeToolCallHook) for event emission and loop detection. Then with an abort signal (wrapToolWithAbortSignal) for cancellation:

const withHooks = normalized.map((tool) =>
  wrapToolWithBeforeToolCallHook(tool, {
    /* ... */
  }),
)
const withAbort = options?.abortSignal
  ? withHooks.map((tool) => wrapToolWithAbortSignal(tool, options.abortSignal))
  : withHooks

Why This Matters

The result: the same agent has different capabilities depending on who's asking (owner vs. guest), what channel they're on (voice vs. text), which model is running (OpenAI vs. Gemini vs. xAI), and whether it's sandboxed (container vs. host).

Same brain. Different permissions. One pipeline.

In a market where most agent frameworks give every user the same flat toolset, this kind of context-aware gating is a real differentiator.

If you've built anything with RAG, you know the pitch: embed your docs, do a similarity search, stuff the results into the prompt. Production teams report 20-35% accuracy gains from hybrid approaches over pure vector search. OpenClaw figured this out early.

Memory lives in SQLite + sqlite-vec. Queries run both a vector search (cosine similarity on embeddings) and a keyword search (SQLite FTS5 with BM25 ranking). Results merge with configurable weights.

The keyword search has a quirk worth knowing. SQLite FTS5 returns BM25 scores as negative numbers (lower = more relevant). So there's a score conversion in bm25RankToScore:

export function bm25RankToScore(rank: number): number {
  if (rank < 0) {
    const relevance = -rank
    return relevance / (1 + relevance)
  }
  return 1 / (1 + rank)
}

Flip the sign, normalize to [0, 1]. Four lines that bridge two completely different scoring systems. The kind of thing that takes five minutes to write and saves hours of debugging.

Temporal Decay

Not all memories are equal. What you said yesterday matters more than what you said last month.

OpenClaw applies exponential decay with a 30-day half-life via calculateTemporalDecayMultiplier:

export function calculateTemporalDecayMultiplier(params: {
  ageInDays: number
  halfLifeDays: number
}): number {
  const lambda = Math.LN2 / params.halfLifeDays
  return Math.exp(-lambda * Math.max(0, params.ageInDays))
}

A memory from yesterday keeps 98% of its weight. After 30 days, 50%. After 90 days, 12.5%.

But some things shouldn't fade. MEMORY.md, memory.md, and undated files inside memory/ are exempt from decay. They're pinned knowledge - "the user's name is Karn," "always respond in English." Recent context fades. Identity persists. isEvergreenMemoryPath handles this:

function isEvergreenMemoryPath(filePath: string): boolean {
  const normalized = filePath.replaceAll("\\", "/").replace(/^\.\//, "")
  if (normalized === "MEMORY.md" || normalized === "memory.md") {
    return true
  }
  if (!normalized.startsWith("memory/")) {
    return false
  }
  return !DATED_MEMORY_PATH_RE.test(normalized)
}

MMR: Don't Give the AI Ten Copies of the Same Conversation

Ask about "deployment" and the top 10 results might all be from the same chat. Maximal Marginal Relevance (Carbonell & Goldstein, 1998) prevents this.

The algorithm picks results that balance relevance with diversity:

MMR = λ * relevance - (1-λ) * max_similarity_to_already_selected

Lambda is 0.7 - favor relevance, but penalize redundancy. Similarity is measured with Jaccard on tokenized snippets. The AI gets varied context, not ten versions of the same thing.

Query Expansion

Queries get expanded for multilingual users:

  • Korean: strips trailing particles (에서, 으로, ) before matching
  • Chinese: generates character n-grams (unigrams + bigrams)
  • Japanese: splits by script (ASCII, katakana, kanji runs)

Graceful Degradation

No embedding provider configured? The system falls back to keyword-only search. No vector similarity, just BM25. It still works - just less rich. This is the kind of resilience that matters when you're running on your own hardware with your own config.

Failover: When the Model Goes Down

Most AI apps show you an error page when the API is down. Your conversation is gone. Start over.

OpenClaw switches models mid-conversation. Two layers: auth profile rotation (same model, different API key when one hits rate limits or auth failures - see advanceAuthProfile and resolveCooldownDecision) and model fallback (when the whole provider is unreachable, FailoverError triggers runWithModelFallback to try the next candidate from agents.defaults.model.fallbacks). You're on WhatsApp, Claude goes down, and the next message comes from Gemini. No error toast. In a world where LLM outages are weekly news, this is table stakes. Most agents still don't do it.

Bonus: Rabbit Holes Worth Following

If you like reading codebases for fun, here are some threads to pull:

  • The Proxy that pretends to be a module (src/plugin-sdk/root-alias.cjs): The entire plugin SDK is a JavaScript Proxy that intercepts get, has, ownKeys, and getOwnPropertyDescriptor. It fakes being a real module. 35 channel SDKs, loaded only when you access them. The plugin runtime itself (src/plugins/loader.ts) is also a Proxy. Lazy all the way down.

  • Skills as Markdown (skills/): The agent learns new capabilities by reading SKILL.md files. No code - just prose instructions loaded into the system prompt. Want the agent to use the GitHub CLI? Write a Markdown doc explaining gh. That's it.

  • Context compaction (src/agents/pi-embedded-runner/compact.ts): When conversation history blows past the token budget, old messages get summarized and replaced. The session survives indefinitely - it just gets compressed.

  • Sharp vs sips (src/media/image-ops.ts): On macOS with Bun, Sharp's native bindings don't work. Three lines detect the runtime, check the platform, and swap to Apple's built-in sips CLI. A tiny decision that prevents a whole category of "works on my machine" bugs.

  • 7-tier message routing (src/routing/resolve-route.ts): Every incoming message cascades through peer > parent peer > guild+roles > guild > team > account > channel. First match wins. Route caches use WeakMap<Config, ...> - garbage collected automatically when config changes.

What I Took Away

Composition beats monoliths. Tools, policies, context, and failover are all composed from small, independent pieces. No god function orchestrates everything - each layer does one job and hands off.

Always have a fallback. No embeddings? Keyword search. No API key? Next key. No provider? Next provider. The system always has somewhere to go. The user never sees an error page if there's any path forward. In a market where agents fail 25-40% of the time across consecutive runs, this is what separates a toy from a tool.

Authorization is a pipeline, not a gate. Eight layers of tool filtering means the same AI behaves differently for different users, channels, models, and sandboxes. It's not "can you use tools: yes/no" - it's "which tools, for whom, in what context."

Memory needs more than embeddings. Hybrid search, temporal decay, diversity re-ranking, multilingual expansion, evergreen exemptions. Vector similarity alone gets you 60-70% accuracy. The real work - the stuff that gets you to 90%+ - is everything built around it.

Where This Is All Going

We're in the "everyone has a demo" phase of AI agents. The next 18 months will sort out who has infrastructure.

The patterns in OpenClaw's codebase - retry loops, model failover, permission pipelines, hybrid memory - aren't unique to OpenClaw. They're the same patterns showing up in every serious agent deployment right now. Sierra built them for enterprise. TrueFoundry productized them. OpenClaw shipped them open-source. The convergence is telling.

What's coming next, based on where the hard problems still are:

Multi-agent coordination will get messy before it gets clean. Spawning subagents is easy. Knowing when to spawn them, how to scope their permissions, and how to merge their outputs without hallucinating a consensus - that's still unsolved. OpenClaw has subagent spawning today; the hard part is the orchestration layer on top.

The memory problem is still open. Hybrid search + decay + MMR is genuinely good. But "what should the agent remember forever vs. let fade" is still a human judgment call baked into config files. The next leap is agents that manage their own memory hygiene - deciding what's worth keeping without being told.

Local-first will matter more, not less. As agents get access to more sensitive tools (shell, files, messaging, finance), running them on your own hardware stops being a hobbyist preference and starts being a compliance requirement. OpenClaw's local gateway model looks prescient from here.

The channel layer is underrated. Right now most agent UIs are chat boxes. OpenClaw's 36-channel model - where the same agent lives in WhatsApp, Slack, iMessage, and Discord simultaneously - hints at what agents look like when they stop being apps and start being ambient. You don't open the agent. It's just... there, wherever you already are.

The demos are impressive. The infrastructure is what will actually matter. The teams building the retry loops and the permission pipelines and the memory systems today are the ones whose agents will still be running in two years.


What's the most interesting AI agent architecture you've come across? I'd love to hear about it.

Sources:

© 2026 karnstack · by karn

© 2026 karnstack · by karn