WeftOS Toy EML-Transformer — live browser trainer
Trains a tiny attention block in your browser and exports Rust-loadable JSON. Pure TypeScript port of the Rust ToyEmlAttention at crates/eml-core/src/attention.rs. No Python, no Pyodide, no Colab — pick a mode, click Train, watch it learn, download the weights.
Forward-pass-only sibling at /clawft_eml-attention. Architecture spec at /docs/weftos/eml-attention.
New here? What is this thing actually doing?
This page trains an attention block — the same building block used inside every modern transformer LLM (GPT, Claude, etc.) — but at toy scale (a few hundred parameters instead of billions).
What attention does: given a sequence of tokens (here represented as random vectors), each output position is a weighted blend of every input position. The weights — the attention pattern — are computed dynamically per input. That dynamism is what lets transformers handle variable-length, context-dependent inputs.
The five sub-models: attention computes three projections of the input (Q for Query, K for Key, V for Value), then uses Q and K to decide weights via softmax(Q·Kᵀ / √dₖ), then blends V using those weights, then projects through a final output layer. The "Iter 0" variant only trains the output layer; "Iter 1" trains all four (Q/K/V/out) jointly.
What training does: we feed in random input vectors paired with target outputs (here: per-sequence-position mean broadcast back across d_model), measure how wrong the current weights are (Mean Squared Error), then nudge weights to reduce that error. This page uses gradient-free coordinate descent: pick a random parameter, perturb it, keep the change if MSE drops, otherwise revert. No backprop, no autograd — just trial-and-error guided by error feedback. That's slower than gradient descent but simpler, deterministic, and small-substrate-friendly.
Two substrates compared in "EML vs baseline" mode:
- EML (SafeTree): each Q/K/V/out projection is a depth-N tree of
exp(x) − ln(y)operators. Universal function approximator, weights snap to integers, human-readable trained models. WeftOS's bet for interpretable learned functions. - Baseline (affine): each projection is a plain
W·x + bmatrix multiply. Standard transformer math. Fewer parameters, faster, easier to train.
Same training loop, same trial budget, same seed — only the substrate differs. Watch them converge side-by-side.
Configuration
These knobs change the shape of the attention block and the training budget. d_model is the input embedding dimension per token; seq_len is the number of tokens per sample; d_k is the Q/K/V projection width; depth is the EML tree depth (ignored by the affine baseline). rounds and samples control how long training runs. seed makes runs reproducible — same seed, same trajectory.
Parameter count at this shape: 2848. The trainer only optimizes out_model per the Iteration-0 spec; Q/K/V stay at their deterministic seeded init.
Train
Pick a mode, then click Train. EML vs baseline runs both substrates back-to-back and shows a side-by-side comparison — this is the most informative mode and the default. The single-mode options run only one substrate so you can study its behavior in isolation.
Export Rust-loadable JSON
Each download contains the full weights for one submodel in a shape that round-trips with eml_core::EmlModel. You can train offline in the browser, commit the JSON to your repo, and EmlModel::from_json loads them directly.
Loading in Rust
use eml_core::EmlModel;
let q_json = std::fs::read_to_string("eml_attention_toy_q.json")?;
let q = EmlModel::from_json(&q_json).unwrap();
// same for k, v, out — plug into ToyEmlAttentionWhat's happening under the hood
Forward pass — the attention math: every time the trainer evaluates MSE, the block runs this sequence:
- Project to Q, K, V. The flattened input (
seq_len × d_model) goes into three separate projection layers. Each produces aseq_len × d_ktensor. Q is what each position is "asking about", K is what each position "advertises", V is what each position will contribute if attended to. - Score: for every pair of positions (i, j), compute
scoresᵢⱼ = (Qᵢ · Kⱼ) / √dₖ. Bigger score means position i wants more of position j's value. - Softmax: turn each row of scores into a probability distribution (rows sum to 1). This is the attention pattern — the dynamic, input-dependent routing.
- Blend V: compute
contextᵢ = Σⱼ attnᵢⱼ · Vⱼ. Each position now contains a weighted blend of every position's value. - Output projection: project context back to
d_modeldimensions per position. This is the block's output.
Real transformers stack many of these blocks (12 in BERT-base, 96 in GPT-3) with residual connections, layer norm, and a feed-forward network between attention layers. We're showing one block in isolation.
Training — what the optimizer is actually doing:
- Pick a random parameter index across the union of all layers.
- Generate a perturbation
δfrom an annealing schedule (big early, small late). - Apply
δ, run the forward pass on a 16-sample subset, measure MSE. - If MSE is lower than the previous best, keep the change. Else revert.
- Repeat for the full trial budget. The fraction of accepted perturbations decreases as the model converges — early on, easy wins; late on, you need precise small steps to make progress.
This is random coordinate descent with simulated annealing. It's the simplest possible gradient-free optimizer that still works on hundreds of parameters. Real LLMs use Adam (gradient-based) which is millions of times more sample- efficient — but requires backprop, which EML deliberately avoids because the exp(x) − ln(y) tree is hard to differentiate cleanly.
What MSE means here: Mean Squared Error is the average of (prediction − target)² across all output dimensions and all samples. MSE = 0 means perfect prediction. MSE = 1.0 on this task means the predictions are about 1 unit off per dimension on average (since inputs are in [-1, 1]).
Why two substrates?
EML (Exp-Minus-Log) replaces every multiply with a tree of eml(x, y) = exp(x) − ln(y) operations. Combined with the constant 1, this single operator can reconstruct any elementary mathematical function (Odrzywolel 2026, arXiv:2603.21852). Trained EML weights snap to exact integers, making the model human-readable as a closed-form formula. WeftOS has 12 production EML models doing scalar control-plane decisions.
Baseline is just W·x + b — the standard transformer projection. No tree, no transcendental operators, just dot products. This is what every published transformer uses.
The honest comparison at toy scale (see results above): plain affine attention converges much faster, runs ~3× faster on forward pass, and uses ~2× fewer parameters than EML. What EML buys at this scale is interpretability — the trained model is a closed-form expression, not an opaque weight tensor. Whether that tradeoff is worth it depends on what you're building. For sequence-modeling backbones, baseline wins. For control-plane scalar decisions (the existing 12 WeftOS wrappers), EML wins decisively.
This is research, not production. The real finding from this experiment isn't "use EML attention" — it's "we now know EML's sweet spot ends around vector-output sequence work, and starts paying off again as scalar interpretable functions". See the EML Attention docs for the full reasoning and the upcoming "EML at the control-plane boundary" design pattern.