clawft

Self-Healing Supervisor

Erlang-inspired restart strategies, budget tracking, exponential backoff, and process supervision for fault-tolerant agent management.

The self-healing supervisor provides Erlang-inspired fault tolerance for WeftOS agents. When an agent crashes, the supervisor applies a configurable restart strategy, respects a budget to prevent restart storms, and uses exponential backoff to avoid thundering herds.

Feature: os-patterns Source: crates/clawft-kernel/src/supervisor.rs Phase: K1

Restart Strategies

Five strategies determine what happens when an agent fails:

pub enum RestartStrategy {
    OneForOne,   // Restart only the failed child (default)
    OneForAll,   // Restart all children if one fails
    RestForOne,  // Restart failed child + all started after it
    Permanent,   // Do not restart — let the agent stay dead
    Transient,   // Restart only if exit code is non-zero
}

Strategy Selection Guide

StrategyUse Case
OneForOneIndependent agents that don't share state
OneForAllTightly coupled agents where partial state is invalid
RestForOnePipeline stages where downstream depends on upstream
PermanentOne-shot tasks (batch jobs, migrations)
TransientAgents that may exit cleanly (0) but should restart on crash

Restart Budget

Prevents infinite restart loops by capping restarts within a time window:

pub struct RestartBudget {
    pub max_restarts: u32,  // Default: 5
    pub within_secs: u64,   // Default: 60
}

When the budget is exhausted, the supervisor escalates — stops itself and notifies its parent supervisor. The window auto-resets after within_secs elapses.

Exponential Backoff

Each restart increases the delay exponentially:

100ms → 200ms → 400ms → 800ms → ... → 30s (cap)

Formula: 100ms * 2^(restart_count - 1), capped at 30 seconds. This prevents a crashing agent from consuming resources in a tight loop.

Restart Tracker

The RestartTracker maintains per-agent restart state:

pub struct RestartTracker {
    pub restart_count: u32,
    pub window_start: Instant,
    pub last_restart: Option<Instant>,
    pub backoff_ms: u64,
}

Key methods:

MethodDescription
record_restart(budget)Record a restart, returns false if budget exceeded
is_exhausted(budget)Check if budget is used up
remaining(budget)Restarts remaining in current window
next_backoff_ms()Calculate next backoff delay
should_restart(strategy, exit_code)Policy check for restart decision

Supervisor Integration

The AgentSupervisor integrates restart strategies into agent lifecycle:

supervisor.spawn_supervised(
    agent_config,
    RestartStrategy::OneForOne,
    RestartBudget { max_restarts: 3, within_secs: 30 },
);

When an agent exits:

  1. should_restart() checks the strategy and exit code
  2. If yes, record_restart() checks the budget
  3. If within budget, the agent sleeps for backoff_ms then restarts
  4. If budget exceeded, the supervisor escalates to its parent

Configuration

In weave.toml:

[supervisor]
default_strategy = "OneForOne"
max_restarts = 5
restart_window_secs = 60

Testing

15+ supervisor tests cover all strategies, budget exhaustion, backoff calculation, window reset, and escalation paths.

On this page