WeftOS Toy EML-Transformer — live browser trainer

Trains a tiny attention block in your browser and exports Rust-loadable JSON. Pure TypeScript port of the Rust ToyEmlAttention at crates/eml-core/src/attention.rs. No Python, no Pyodide, no Colab — pick a mode, click Train, watch it learn, download the weights.

Forward-pass-only sibling at /clawft_eml-attention. Architecture spec at /docs/weftos/eml-attention.

New here? What is this thing actually doing?

This page trains an attention block — the same building block used inside every modern transformer LLM (GPT, Claude, etc.) — but at toy scale (a few hundred parameters instead of billions).

What attention does: given a sequence of tokens (here represented as random vectors), each output position is a weighted blend of every input position. The weights — the attention pattern — are computed dynamically per input. That dynamism is what lets transformers handle variable-length, context-dependent inputs.

The five sub-models: attention computes three projections of the input (Q for Query, K for Key, V for Value), then uses Q and K to decide weights via softmax(Q·Kᵀ / √dₖ), then blends V using those weights, then projects through a final output layer. The "Iter 0" variant only trains the output layer; "Iter 1" trains all four (Q/K/V/out) jointly.

What training does: we feed in random input vectors paired with target outputs (here: per-sequence-position mean broadcast back across d_model), measure how wrong the current weights are (Mean Squared Error), then nudge weights to reduce that error. This page uses gradient-free coordinate descent: pick a random parameter, perturb it, keep the change if MSE drops, otherwise revert. No backprop, no autograd — just trial-and-error guided by error feedback. That's slower than gradient descent but simpler, deterministic, and small-substrate-friendly.

Two substrates compared in "EML vs baseline" mode:

  • EML (SafeTree): each Q/K/V/out projection is a depth-N tree of exp(x) − ln(y) operators. Universal function approximator, weights snap to integers, human-readable trained models. WeftOS's bet for interpretable learned functions.
  • Baseline (affine): each projection is a plain W·x + b matrix multiply. Standard transformer math. Fewer parameters, faster, easier to train.

Same training loop, same trial budget, same seed — only the substrate differs. Watch them converge side-by-side.

Configuration

These knobs change the shape of the attention block and the training budget. d_model is the input embedding dimension per token; seq_len is the number of tokens per sample; d_k is the Q/K/V projection width; depth is the EML tree depth (ignored by the affine baseline). rounds and samples control how long training runs. seed makes runs reproducible — same seed, same trajectory.

Parameter count at this shape: 2848. The trainer only optimizes out_model per the Iteration-0 spec; Q/K/V stay at their deterministic seeded init.

Train

Pick a mode, then click Train. EML vs baseline runs both substrates back-to-back and shows a side-by-side comparison — this is the most informative mode and the default. The single-mode options run only one substrate so you can study its behavior in isolation.

mode:
Target: per-position mean broadcast (low-rank, learnable under both modes).

Export Rust-loadable JSON

Each download contains the full weights for one submodel in a shape that round-trips with eml_core::EmlModel. You can train offline in the browser, commit the JSON to your repo, and EmlModel::from_json loads them directly.

Loading in Rust

use eml_core::EmlModel;

let q_json = std::fs::read_to_string("eml_attention_toy_q.json")?;
let q = EmlModel::from_json(&q_json).unwrap();
// same for k, v, out — plug into ToyEmlAttention

What's happening under the hood

Forward pass — the attention math: every time the trainer evaluates MSE, the block runs this sequence:

  1. Project to Q, K, V. The flattened input (seq_len × d_model) goes into three separate projection layers. Each produces a seq_len × d_k tensor. Q is what each position is "asking about", K is what each position "advertises", V is what each position will contribute if attended to.
  2. Score: for every pair of positions (i, j), compute scoresᵢⱼ = (Qᵢ · Kⱼ) / √dₖ. Bigger score means position i wants more of position j's value.
  3. Softmax: turn each row of scores into a probability distribution (rows sum to 1). This is the attention pattern — the dynamic, input-dependent routing.
  4. Blend V: compute contextᵢ = Σⱼ attnᵢⱼ · Vⱼ. Each position now contains a weighted blend of every position's value.
  5. Output projection: project context back to d_model dimensions per position. This is the block's output.

Real transformers stack many of these blocks (12 in BERT-base, 96 in GPT-3) with residual connections, layer norm, and a feed-forward network between attention layers. We're showing one block in isolation.

Training — what the optimizer is actually doing:

  1. Pick a random parameter index across the union of all layers.
  2. Generate a perturbation δ from an annealing schedule (big early, small late).
  3. Apply δ, run the forward pass on a 16-sample subset, measure MSE.
  4. If MSE is lower than the previous best, keep the change. Else revert.
  5. Repeat for the full trial budget. The fraction of accepted perturbations decreases as the model converges — early on, easy wins; late on, you need precise small steps to make progress.

This is random coordinate descent with simulated annealing. It's the simplest possible gradient-free optimizer that still works on hundreds of parameters. Real LLMs use Adam (gradient-based) which is millions of times more sample- efficient — but requires backprop, which EML deliberately avoids because the exp(x) − ln(y) tree is hard to differentiate cleanly.

What MSE means here: Mean Squared Error is the average of (prediction − target)² across all output dimensions and all samples. MSE = 0 means perfect prediction. MSE = 1.0 on this task means the predictions are about 1 unit off per dimension on average (since inputs are in [-1, 1]).

Why two substrates?

EML (Exp-Minus-Log) replaces every multiply with a tree of eml(x, y) = exp(x) − ln(y) operations. Combined with the constant 1, this single operator can reconstruct any elementary mathematical function (Odrzywolel 2026, arXiv:2603.21852). Trained EML weights snap to exact integers, making the model human-readable as a closed-form formula. WeftOS has 12 production EML models doing scalar control-plane decisions.

Baseline is just W·x + b — the standard transformer projection. No tree, no transcendental operators, just dot products. This is what every published transformer uses.

The honest comparison at toy scale (see results above): plain affine attention converges much faster, runs ~3× faster on forward pass, and uses ~2× fewer parameters than EML. What EML buys at this scale is interpretability — the trained model is a closed-form expression, not an opaque weight tensor. Whether that tradeoff is worth it depends on what you're building. For sequence-modeling backbones, baseline wins. For control-plane scalar decisions (the existing 12 WeftOS wrappers), EML wins decisively.

This is research, not production. The real finding from this experiment isn't "use EML attention" — it's "we now know EML's sweet spot ends around vector-output sequence work, and starts paying off again as scalar interpretable functions". See the EML Attention docs for the full reasoning and the upcoming "EML at the control-plane boundary" design pattern.