Self-Healing Supervisor

Erlang-inspired restart strategies, budget tracking, exponential backoff, and process supervision for fault-tolerant agent management.

The self-healing supervisor provides Erlang-inspired fault tolerance for WeftOS agents. When an agent crashes, the supervisor applies a configurable restart strategy, respects a budget to prevent restart storms, and uses exponential backoff to avoid thundering herds.

Feature: os-patterns Source: crates/clawft-kernel/src/supervisor.rs Phase: K1

Restart Strategies

Five strategies determine what happens when an agent fails:

pub enum RestartStrategy {
    OneForOne,   // Restart only the failed child (default)
    OneForAll,   // Restart all children if one fails
    RestForOne,  // Restart failed child + all started after it
    Permanent,   // Do not restart — let the agent stay dead
    Transient,   // Restart only if exit code is non-zero
}

Strategy Selection Guide

Strategy	Use Case
`OneForOne`	Independent agents that don't share state
`OneForAll`	Tightly coupled agents where partial state is invalid
`RestForOne`	Pipeline stages where downstream depends on upstream
`Permanent`	One-shot tasks (batch jobs, migrations)
`Transient`	Agents that may exit cleanly (0) but should restart on crash

Restart Budget

Prevents infinite restart loops by capping restarts within a time window:

pub struct RestartBudget {
    pub max_restarts: u32,  // Default: 5
    pub within_secs: u64,   // Default: 60
}

When the budget is exhausted, the supervisor escalates — stops itself and notifies its parent supervisor. The window auto-resets after within_secs elapses.

Exponential Backoff

Each restart increases the delay exponentially:

100ms → 200ms → 400ms → 800ms → ... → 30s (cap)

Formula: 100ms * 2^(restart_count - 1), capped at 30 seconds. This prevents a crashing agent from consuming resources in a tight loop.

Restart Tracker

The RestartTracker maintains per-agent restart state:

pub struct RestartTracker {
    pub restart_count: u32,
    pub window_start: Instant,
    pub last_restart: Option<Instant>,
    pub backoff_ms: u64,
}

Key methods:

Method	Description
`record_restart(budget)`	Record a restart, returns `false` if budget exceeded
`is_exhausted(budget)`	Check if budget is used up
`remaining(budget)`	Restarts remaining in current window
`next_backoff_ms()`	Calculate next backoff delay
`should_restart(strategy, exit_code)`	Policy check for restart decision

Supervisor Integration

The AgentSupervisor integrates restart strategies into agent lifecycle:

supervisor.spawn_supervised(
    agent_config,
    RestartStrategy::OneForOne,
    RestartBudget { max_restarts: 3, within_secs: 30 },
);

When an agent exits:

should_restart() checks the strategy and exit code
If yes, record_restart() checks the budget
If within budget, the agent sleeps for backoff_ms then restarts
If budget exceeded, the supervisor escalates to its parent

Configuration

In weave.toml:

[supervisor]
default_strategy = "OneForOne"
max_restarts = 5
restart_window_secs = 60

Testing

15+ supervisor tests cover all strategies, budget exhaustion, backoff calculation, window reset, and escalation paths.

Self-Healing Supervisor

On this page