Self-Healing Supervisor
Erlang-inspired restart strategies, budget tracking, exponential backoff, and process supervision for fault-tolerant agent management.
The self-healing supervisor provides Erlang-inspired fault tolerance for WeftOS agents. When an agent crashes, the supervisor applies a configurable restart strategy, respects a budget to prevent restart storms, and uses exponential backoff to avoid thundering herds.
Feature: os-patterns
Source: crates/clawft-kernel/src/supervisor.rs
Phase: K1
Restart Strategies
Five strategies determine what happens when an agent fails:
pub enum RestartStrategy {
OneForOne, // Restart only the failed child (default)
OneForAll, // Restart all children if one fails
RestForOne, // Restart failed child + all started after it
Permanent, // Do not restart — let the agent stay dead
Transient, // Restart only if exit code is non-zero
}Strategy Selection Guide
| Strategy | Use Case |
|---|---|
OneForOne | Independent agents that don't share state |
OneForAll | Tightly coupled agents where partial state is invalid |
RestForOne | Pipeline stages where downstream depends on upstream |
Permanent | One-shot tasks (batch jobs, migrations) |
Transient | Agents that may exit cleanly (0) but should restart on crash |
Restart Budget
Prevents infinite restart loops by capping restarts within a time window:
pub struct RestartBudget {
pub max_restarts: u32, // Default: 5
pub within_secs: u64, // Default: 60
}When the budget is exhausted, the supervisor escalates — stops itself and notifies its parent supervisor. The window auto-resets after within_secs elapses.
Exponential Backoff
Each restart increases the delay exponentially:
100ms → 200ms → 400ms → 800ms → ... → 30s (cap)Formula: 100ms * 2^(restart_count - 1), capped at 30 seconds. This prevents a crashing agent from consuming resources in a tight loop.
Restart Tracker
The RestartTracker maintains per-agent restart state:
pub struct RestartTracker {
pub restart_count: u32,
pub window_start: Instant,
pub last_restart: Option<Instant>,
pub backoff_ms: u64,
}Key methods:
| Method | Description |
|---|---|
record_restart(budget) | Record a restart, returns false if budget exceeded |
is_exhausted(budget) | Check if budget is used up |
remaining(budget) | Restarts remaining in current window |
next_backoff_ms() | Calculate next backoff delay |
should_restart(strategy, exit_code) | Policy check for restart decision |
Supervisor Integration
The AgentSupervisor integrates restart strategies into agent lifecycle:
supervisor.spawn_supervised(
agent_config,
RestartStrategy::OneForOne,
RestartBudget { max_restarts: 3, within_secs: 30 },
);When an agent exits:
should_restart()checks the strategy and exit code- If yes,
record_restart()checks the budget - If within budget, the agent sleeps for
backoff_msthen restarts - If budget exceeded, the supervisor escalates to its parent
Configuration
In weave.toml:
[supervisor]
default_strategy = "OneForOne"
max_restarts = 5
restart_window_secs = 60Testing
15+ supervisor tests cover all strategies, budget exhaustion, backoff calculation, window reset, and escalation paths.