How it works
┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ LLM reads │────▶│ Proposes │────▶│ Train model │
│ history + │ │ next batch │ │ on GPU/MPS │
│ curves │ │ of configs │ │ │
└─────────────┘ └─────────────┘ └──────┬───────┘
▲ │
│ ┌─────────────┐ │
└────────────│ Log results│◀────────────┘
│ (JSONL) │
└─────────────┘
Each iteration:
- Build a prompt with the search space, best result, last 20 experiments, and per-epoch training curves
- Ask the LLM to propose a batch of configs as JSON
- Validate and clamp the response (fall back to random on any parse failure)
- Train each config, tracking per-epoch train loss, val loss, and val accuracy
- Log everything to JSONL, update best
- Repeat
What the LLM sees
Most tuning tools give the optimizer a single number: "this config scored 0.85."
neuropt shows the full picture:
lr=0.05, activation=relu, use_residual=False:
ep1: train=2.30 val=2.28 acc=0.12
ep2: train=1.45 val=1.52 acc=0.41
ep3: train=0.82 val=1.35 acc=0.53
ep4: train=0.31 val=1.61 acc=0.48 ← val rising = overfitting
ep5: train=0.09 val=1.89 acc=0.45
lr=8.8e-4, activation=gelu, use_residual=True:
ep1: train=1.92 val=1.85 acc=0.28
ep2: train=1.01 val=0.98 acc=0.62
ep3: train=0.62 val=0.71 acc=0.74
ep4: train=0.41 val=0.52 acc=0.81 ← both dropping = good fit
ep5: train=0.33 val=0.43 acc=0.85
The system also pre-computes signals like OVERFITTING: train 2.30→0.09, val 1.52→1.89, gap=1.80 so the LLM doesn't have to do the math.
Why this works better than Bayesian optimization
Bayesian methods (Optuna TPE, Gaussian processes) build a statistical surrogate from (config, score) pairs. They learn correlations but have no domain knowledge — they don't know that high learning rates cause oscillation, or that dropout hurts underfitting models.
The LLM starts with ML knowledge and reads the training curves to understand why something worked or didn't. It can reason about interactions: "this config overfits because the model is too big for this dataset and has no regularization" rather than just "this config scored poorly."
Fallback behavior
If the LLM returns invalid JSON, times out, or the API is down, the system silently generates random configs from the search space. A fallback counter tracks how often this happens. No exceptions propagate — the search continues no matter what.
LLM backends
Auto-detected in order:
| Backend | How | Overhead |
|---|---|---|
| Claude (Haiku 4.5) | ANTHROPIC_API_KEY env var |
~1s/call, cheapest option |
| OpenAI | OPENAI_API_KEY env var |
~1-2s/call |
| Local Qwen 2.5 1.5B | --backend qwen |
~3s/call, runs on CPU |
| Random search | --backend none |
Baseline comparison |