Auto Model Introspection
This is all under the hood
You don't need to understand any of this to use neuropt. Just call ArchSearch.from_model(model, train_fn) and it handles everything. This page is for people who want to know what's happening inside.
With ArchSearch.from_model(), you don't specify any hyperparameters or search ranges — neuropt walks your model's module tree, discovers what's tunable, and builds a search space for you.
from neuropt import ArchSearch
model = build_my_cnn() # or load a pretrained model
def train_fn(config):
m = config["model"].to("cuda") # deep copy with modifications applied
optimizer = torch.optim.Adam(m.parameters(), lr=config["lr"])
# ... train ...
return {"score": val_loss, "train_losses": [...], "val_losses": [...]}
search = ArchSearch.from_model(model, train_fn, backend="claude")
search.run(max_evals=50)
neuropt prints what it found and what it will search over:
Introspected PyTorch model (77,322 params):
Activations: ReLU, Mish (2 layers)
Dropout: 3 layers (rate=0.20)
BatchNorm: 4 layers
Pooling: 1 layers (current: avg)
Search space: ['activation', 'dropout', 'use_batchnorm', 'pool_type',
'lr', 'wd', 'optimizer']
Your original model is never modified — each experiment gets a deep copy with the LLM's proposed changes applied.
How it works
- Walk the module tree —
model.named_modules()finds every layer - Classify each module — activations, dropout, norms, pooling, etc.
- Build a search space — each detected component becomes a tunable parameter
- Wrap your train_fn — deep-copies the model, applies config, passes it as
config["model"] - Detect pretrained weights — if the model looks pretrained, add fine-tuning strategies (see Fine-Tuning)
The LLM gets context about what was detected ("4 BatchNorm layers", "attention and FF dropout found separately") and domain-specific guidance so it can reason about why to try a change, not just guess.
What gets detected
| Component | What neuropt finds | Search param |
|---|---|---|
| Activations | ReLU, GELU, SiLU, Mish, Hardswish, PReLU, LeakyReLU |
activation (7-way swap) |
| Dropout | Dropout, Dropout1d/2d/3d, AlphaDropout, FeatureAlphaDropout |
dropout (rate) |
| MHA dropout | MultiheadAttention internal .dropout float |
mha_dropout (rate) |
| Per-path dropout | Groups by path name (attn, ff, mlp, embed) |
dropout_attn, dropout_ff, etc. |
| BatchNorm | BatchNorm1d/2d/3d |
use_batchnorm (on/off) |
| LayerNorm | LayerNorm |
use_layernorm (on/off) |
| Pooling | AdaptiveAvgPool, AdaptiveMaxPool |
pool_type (avg/max/attention) |
| Training | Always included | lr, wd, optimizer |
If pretrained weights are detected, additional fine-tuning params are added — see Fine-Tuning Pretrained Models.
Activations
All detected activation layers are swapped together. The 7 options:
| Activation | Notes |
|---|---|
relu |
Standard default |
gelu |
Common in transformers, often better than ReLU |
silu |
Smooth, used in EfficientNet and modern CNNs |
mish |
Competitive with SiLU, used in YOLOv4/v5 |
hardswish |
Cheaper SiLU approximation, good for mobile/edge |
leaky_relu |
Avoids dead neurons |
prelu |
Learnable negative slope (adds a small number of parameters) |
Dropout
neuropt detects all PyTorch dropout variants — Dropout, Dropout1d, Dropout2d, Dropout3d, AlphaDropout, and FeatureAlphaDropout. The rate is tuned as a single dropout param.
Per-path dropout — for transformers with dropout in distinct paths, neuropt groups by module path name and emits separate params so the LLM can tune attention and feedforward dropout independently:
- Path contains
attn/attention→dropout_attn - Path contains
ff/ffn/mlp→dropout_ff - Path contains
embed→dropout_embed - Falls back to single
dropoutif only one group exists
MHA internal dropout — nn.MultiheadAttention stores dropout as a plain float attribute, not a module. neuropt catches this and adds mha_dropout as a separate search param.
Normalization
BatchNorm and LayerNorm can each be toggled off (replaced with nn.Identity()).
- BatchNorm off — sometimes helps very small models or when batch sizes vary widely
- LayerNorm off — rarely helpful in transformers, but the LLM can test it
Pooling
When AdaptiveAvgPool or AdaptiveMaxPool layers are found, neuropt adds a three-way swap:
| Pool type | How it works |
|---|---|
avg |
Average over all spatial positions (standard default) |
max |
Keep the strongest activation per channel |
attention |
Learned weighted average — a small linear projection computes attention weights over spatial positions (~C extra params) |
Attention pooling is useful when not all spatial positions are equally important for the task — it learns where to look.