Fine-Tuning Pretrained Models

This is all under the hood

neuropt handles fine-tuning detection and strategy selection automatically. Just pass a pretrained model to from_model() — no configuration needed. This page explains what's happening inside.

When from_model() detects pretrained weights, it adds fine-tuning strategies to the search space alongside the standard introspection params. The LLM uses domain knowledge about what works for fine-tuning — freeze strategies, learning rate decay, regularization toward pretrained weights — and decides what to try based on your training curves.

Quick start

import torchvision
from neuropt import ArchSearch

model = torchvision.models.resnet18(weights="DEFAULT")

def train_fn(config):
    m = config["model"].to("cuda")
    optimizer = torch.optim.AdamW(
        [p for p in m.parameters() if p.requires_grad],
        lr=config["lr"],
    )
    # ... train ...
    return {"score": val_loss, "train_losses": [...], "val_losses": [...]}

search = ArchSearch.from_model(model, train_fn, backend="claude")
search.run(max_evals=50)

Introspected PyTorch model (11,689,512 params):
  Activations: ReLU (9 layers)
  BatchNorm: 20 layers
  Pooling: 1 layers (current: avg)
  Pretrained: yes (fine-tuning strategies enabled)
  Head: fc
  Search space: ['activation', 'dropout', 'use_batchnorm', 'pool_type', 'lr',
    'wd', 'optimizer', 'freeze_strategy', 'lr_layer_decay', 'l2sp_regularization']

How pretrained detection works

neuropt compares each parameter's variance to what PyTorch's default initialization would produce. Pretrained models that have been trained with weight decay have significantly lower variance than random init. This is automatic — no labels or metadata needed.

Override if the heuristic gets it wrong:

search = ArchSearch.from_model(model, train_fn, pretrained=True)   # force on
search = ArchSearch.from_model(model, train_fn, pretrained=False)  # force off

neuropt also detects layer groups (repeating numbered blocks like layer1.0, layer1.1) and the classification head (last nn.Linear), which are used by the freeze strategies below.

Fine-tuning search params

These three params are only added when pretrained weights are detected.

`freeze_strategy`

Controls which layers are trainable. The LLM is given guidance about when each works best.

Strategy	What it does	When to use
`full`	Train everything (no-op)	Large datasets, no forgetting risk
`head_only`	Freeze all, train only the classification head	Small datasets, safest option
`gradual_unfreeze`	Freeze all, unfreeze last ~1/3 of layer groups + head	Good default for medium datasets
`all_but_embeddings`	Freeze only `nn.Embedding` layers	NLP models where embeddings are well-trained

`lr_layer_decay`

Float in [0.5, 1.0]. Passed through as config["lr_layer_decay"] for you to implement layer-wise learning rate decay (LLRD) in your training loop.

Near 1.0 = uniform LR across all layers
Near 0.5 = aggressive decay (early layers learn much slower)

This is a pass-through value — neuropt puts it in config, you apply it:

def train_fn(config):
    m = config["model"]
    decay = config.get("lr_layer_decay", 1.0)
    param_groups = []
    for i, (name, param) in enumerate(m.named_parameters()):
        if param.requires_grad:
            layer_lr = config["lr"] * (decay ** i)
            param_groups.append({"params": [param], "lr": layer_lr})
    optimizer = torch.optim.AdamW(param_groups)
    # ...

`l2sp_regularization`

Boolean. When True, neuropt snapshots the pretrained weights before training and injects them as config["pretrained_weights"] — a dict of {name: tensor}. Use these to regularize toward pretrained weights instead of zero (L2-SP), which prevents catastrophic forgetting.

def train_fn(config):
    m = config["model"].to("cuda")
    # ... forward pass ...
    loss = task_loss
    if "pretrained_weights" in config:
        l2sp_loss = sum(
            ((p - config["pretrained_weights"][n].to(p.device)) ** 2).sum()
            for n, p in m.named_parameters() if p.requires_grad
        )
        loss = loss + 0.01 * l2sp_loss
    # ...

What the LLM knows

When a pretrained model is detected, the LLM's context includes fine-tuning-specific guidance:

head_only is safest for small datasets
gradual_unfreeze is a good default for medium datasets
full is best with enough data but risks catastrophic forgetting
lr_layer_decay near 0.5 means early layers barely update
L2-SP prevents forgetting — worth trying if full fine-tuning overfits

This lets the LLM reason about your training curves in the context of fine-tuning: "val loss is rising with full strategy → try head_only or increase L2-SP" rather than generic overfitting advice.

Early results

In a 10-eval benchmark fine-tuning ViT-B/16 on CIFAR-10, neuropt found a config achieving 0.142 val loss / 95.7% accuracy — the LLM quickly converged on last_blocks freeze + AdamW + lr=3e-4. Optuna's best in the same budget was 0.195. The LLM's domain knowledge about freeze strategies gave it an edge that statistical surrogates can't match.