Fine-Tuning a Large Vision Model Without Touching Most of Its Weights

Two things make fine-tuning large models painful. The first is compute — storing gradients for tens of millions of parameters is expensive. The second is: the more you update, the more you risk losing what the model already knew. I ran into both when working on SAM for medical image segmentation. That's what pushed me to look at parameter-efficient fine-tuning, and specifically at LoRA — a method that lets you adapt a large pretrained model by only training a small fraction of it, while keeping everything else exactly as it was.

This post walks through what LoRA is, how it works, and how I used it — the concept, the math, and the implementation choices, including the ones where the reasoning wasn't obvious to me at first.

Why full fine-tuning is often the wrong move

SAM — Segment Anything Model — is a foundation model released by Meta in 2023. Give it an image and a prompt (a click, a box, or a rough mask) and it segments whatever you're pointing at. It was trained on over a billion segmentation masks across natural images, which gives it a strong general understanding of visual boundaries and object structure. That generality is exactly what makes it useful as a starting point — and exactly what makes naive fine-tuning on a small medical dataset risky.

When you have a pretrained model and want to adapt it to a new task, the instinct is to fine-tune everything. Update all the weights on your new data, let the model adjust.

The problem is that "update everything" carries a cost that scales directly with model size. SAM's image encoder has tens of millions of parameters. Updating all of them means computing and storing gradients for all of them at every training step — expensive in memory, slow on a single GPU, and often not feasible without infrastructure most people don't have access to.

But the deeper problem is what happens to the model's existing knowledge. SAM was trained on over a billion segmentation masks. It has a strong understanding of edges, boundaries, shapes, and what makes something a distinct object. If you fully fine-tune on a small domain-specific dataset — say, a thousand colonoscopy images — you can overwrite that general knowledge with something overfit to your narrow distribution. You gain task-specificity and lose generality, and on a small dataset, that trade is usually bad.

What you actually want is to keep everything the pretrained model already knows and add only the adjustments your task requires. That's the problem LoRA solves.

What LoRA is

LoRA — Low-Rank Adaptation — was introduced in a 2021 paper by Hu et al. Originally designed for language models, it works just as well on vision transformers.

The starting observation is this: when you fine-tune a pretrained model, the changes to its weight matrices tend to be low-rank. You don't need to update all the values in a weight matrix to meaningfully adapt the model. The useful part of the update can be expressed as something much smaller.

Here's what that means in practice.

A linear layer computes:

text
output = W · input

W is a weight matrix. In a 768-dimensional transformer, that's 768×768 — around 590,000 values.

LoRA leaves W frozen and adds a small update alongside it:

text
output = W · input  +  (B · A) · input

Where:

→A has shape [rank × in_features]
→B has shape [out_features × rank]
→rank is a small number, typically 4, 8, or 16

B · A has the same shape as W, so the addition works dimensionally. But rather than 590,000 trainable values, you have:

→A: rank × 768 values
→B: 768 × rank values

At rank 4 that's 6,144 total — about 96× fewer than the full matrix. You only train those. W doesn't move.

The reason this is enough is that the adaptation you need is genuinely low-rank. You're not teaching the model a completely new concept. You're shifting how it weights existing features toward your domain. That kind of adjustment lives in a low-dimensional subspace of the full weight space.

Initialisation

How you initialise A and B matters more than it might seem.

B starts at zero. A gets standard random initialisation. Because B · A = 0 at the start, the LoRA term contributes nothing at step zero — the model behaves exactly like the pretrained model at the beginning of training. Adaptation grows from there.

If you initialised both randomly, you'd be starting from a broken model and trying to recover, which makes training unstable.

python
nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))  # standard random init for A
nn.init.zeros_(self.lora_B.weight)                              # B starts at zero

There's also a scaling factor applied to the LoRA output: alpha / rank. The purpose is simple — it controls how much the LoRA update is allowed to influence the output relative to the frozen base. Without it, the adapter could push the output too hard early in training, overriding the pretrained weights before it's learned anything useful.

alpha is a number you set, and rank is the size you chose for your matrices. The reason you have both instead of just one scalar is this: when you increase rank, you're adding more parameters — but each individual parameter ends up doing less per unit of rank, because the capacity is spread across more dimensions. Dividing by rank corrects for that spread, so the overall strength of the LoRA update stays roughly the same regardless of what rank you pick. In practice it means you can change rank without having to re-tune alpha from scratch.

I used alpha=1.0 and rank=4, which gives a scaling of 0.25 — the LoRA term contributes at a quarter the weight of the frozen output. That's a conservative choice. I didn't want the adapter pulling the model away from SAM's pretrained behaviour too aggressively, especially early in training when the LoRA weights are still random. If you're starting from a weaker pretrained model or the domain gap is larger, you'd increase alpha. For adapting SAM specifically, 1.0 is a reasonable default to start from.

Where to inject it

A transformer block contains an attention mechanism with four linear projections: query, key, value, and output. These are the layers that determine what the model pays attention to and how it combines information across positions in the image. This is where most of the domain-sensitive behaviour lives, and most LoRA work targets these specifically.

In my setup I injected into all linear layers in the image encoder, not just the attention projections. The reason is straightforward: transformer blocks have two parts — the attention mechanism and the MLP layers (feed-forward networks that transform features after each attention step) that come after it. The MLP layers are where the model actually transforms features between attention steps, not just routes information. They hold just as much domain-specific behaviour as the attention projections. Injecting only into attention and leaving the MLPs frozen means half the computation is stuck with natural-image priors it can't change.

The tradeoff is more trainable parameters, but it's still a small fraction of the total. If memory is very tight, attention-only works as a starting point — but if you have the room, covering all linear layers gives the adaptation more surface area to work with.

SAM's architecture — and applying LoRA to it

SAM has three main components. The image encoder is a Vision Transformer (ViT) — a model architecture that processes images by splitting them into fixed patches and running attention across those patches, similar to how language transformers process words. It's the heavyweight part of the model: most of the parameters live here, most of the visual understanding happens here, and this is where the domain-specific knowledge needs to change.

The prompt encoder takes the user's input — a click point, a bounding box, or a mask hint — and converts it into embeddings (numerical representations the model can work with) that tell the decoder where to focus. It's relatively lightweight and doesn't need much adaptation for a new domain; the prompt format stays the same.

The mask decoder is a lightweight transformer that takes the image embedding and the prompt embedding, runs cross-attention (a mechanism that lets the prompt "ask questions" of the image features — figuring out which parts of the image the prompt is pointing to) between them, and produces the final segmentation mask. The design philosophy of SAM is that the image encoder does most of the heavy lifting, and the decoder's job is to translate that embedding into a mask given the prompt.

The mismatch with medical images is deeper than just a distribution shift. Natural images have objects with clear semantic identity — a dog is a dog, a chair is a chair — and SAM's encoder learned to find those objects based on high-contrast boundaries, familiar textures, and recognisable shapes. Colonoscopy frames don't have any of that. A polyp is a subtle elevation of mucosa against similar-looking surrounding tissue. Boundaries are soft. Colour variation is minimal. What matters is local texture microstructure and faint gradient transitions that wouldn't register as significant in a natural-image context.

SAM's encoder isn't wrong about what it learned. It just learned to care about the wrong things for this task. LoRA gives it a way to relearn those weights without discarding the underlying structural knowledge that's still useful — the understanding of spatial relationships, boundary detection, feature hierarchy — while teaching it to respond differently to the specific visual patterns that matter in colonoscopy.

This is where LoRA goes in. The image encoder has hundreds of linear projections across its transformer blocks — query, key, value, and output projections in each attention head, plus two linear layers in each MLP block. Freeze the whole model, inject LoRA adapters into all of those, and only those adapters train.

Here's the wrapper that replaces each linear layer:

python
class LoRALinear(nn.Module):
    def __init__(self, original_linear, rank=4, alpha=1.0, dropout=0.05):
        super().__init__()
        self.in_features = original_linear.in_features
        self.out_features = original_linear.out_features
        self.scaling = alpha / rank

        # Original weights — frozen
        self.linear = original_linear
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False

        # LoRA matrices — these are what we train
        self.lora_A = nn.Linear(self.in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, self.out_features, bias=False)
        self.lora_dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        base_out = self.linear(x)
        lora_out = self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
        return base_out + lora_out

And the injection — walks the model tree, finds every nn.Linear, replaces it with the wrapper:

python
def inject_lora(model, rank=4, alpha=1.0, dropout=0.05):
    targets = []
    for name, module in model.named_modules():
        for child_name, child in module.named_children():
            if isinstance(child, nn.Linear):
                targets.append((module, child_name, child))

    for parent, child_name, child in targets:
        setattr(parent, child_name, LoRALinear(child, rank, alpha, dropout))

Applied to SAM:

python
from mobile_sam import sam_model_registry

sam = sam_model_registry["vit_t"](checkpoint="mobile_sam.pt")

# Freeze everything
for p in sam.parameters():
    p.requires_grad = False

# Inject LoRA into the image encoder
inject_lora(sam.image_encoder, rank=4, alpha=1.0, dropout=0.05)

trainable = sum(p.numel() for p in sam.parameters() if p.requires_grad)
total = sum(p.numel() for p in sam.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")
# Trainable: 255,360 / 5,800,000 (4.4%)

4.4% of parameters are trainable. The rest of the model is untouched.

The loss function

Polyp segmentation has a class imbalance problem. A polyp might occupy 3% of the pixels in a frame. A model that always predicts "background" would be 97% accurate and completely useless.

Standard cross-entropy is optimising the wrong thing. You need a loss that cares about overlap quality.

Dice loss measures overlap between the predicted mask and the ground truth directly:

python
def dice_loss(pred_probs, targets, smooth=1.0):
    pred = pred_probs.view(-1)
    target = targets.view(-1)
    intersection = (pred * target).sum()
    return 1 - (2 * intersection + smooth) / (pred.sum() + target.sum() + smooth)

A Dice of 1.0 means perfect overlap. A Dice of 0 means no overlap. Training directly on this means the model is penalised for missing the polyp, regardless of what percentage of pixels it occupies.

Focal loss addresses the same imbalance from a different angle — it down-weights easy examples (confident background predictions) so the gradient signal comes mostly from the hard ones (the polyp region):

python
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.75, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, logits, targets):
        bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
        p_t = torch.exp(-bce)
        return (self.alpha * (1 - p_t) ** self.gamma * bce).mean()

I used both combined, plus a boundary-weighted term that adds extra penalty for errors near the mask edge.

How LoRA performed

The main comparison I ran was LoRA-adapted SAM against vanilla SAM (the base model with no adaptation at all), across three random seeds (separate training runs with different random initialisations, to make sure the result isn't a fluke of one particular run):

Mode	Mean Dice
Vanilla SAM (no adaptation)	~0.65
LoRA rank=4	0.927

The gap is significant. Vanilla SAM scores around 0.65 on colonoscopy images, which makes sense — the model was trained entirely on natural images and has no exposure to this visual domain. LoRA adaptation brings it to 0.927 on the training dataset's test split.

The more meaningful test is whether the adaptation generalised. The model was trained on one colonoscopy dataset, then evaluated on three others — different hospitals, different camera equipment, different patient populations, none of which it had seen during training. It held up at 0.85+ Dice across all three. That's a reasonable signal that LoRA was teaching the model to handle the domain in a general way, not just memorising the training distribution.

For context, vanilla SAM on polyp segmentation benchmarks is widely reported in the literature to sit in the 0.60–0.70 Dice range without adaptation, which aligns with what I saw. Published work on SAM fine-tuning for medical segmentation (SAMed, MedSAM, SAM-Med2D) generally reports Dice in the 0.85–0.92 range on similar benchmarks, so the numbers here are within the range of what adapted SAM variants achieve.

Why this matters beyond this task

LoRA changes the practical side of working with large pretrained models in a few concrete ways.

Storage. Without LoRA, adapting a model to a new task means storing a full copy of the updated weights per task. With LoRA you keep one base model and store only the small adapter weights — a few MB instead of hundreds of MB or more. If you're maintaining multiple adaptations of the same model, this becomes a real difference.

Compute during training. Gradients only flow through the LoRA parameters. The frozen weights participate in the forward pass but not backpropagation (the step where the model figures out how to update its weights based on the loss). This cuts memory usage significantly — tasks that would need multi-GPU setups otherwise are often manageable on a single GPU with LoRA.

Swappability. Because the adapters are separate from the base model, you can load different adapters at inference time without reloading the base model. One base model in memory, multiple task-specific adapters swapped as needed.

My thoughts

The thing I keep coming back to is that LoRA reveals something about what fine-tuning is actually doing. The fact that a low-rank update is enough tells you the adaptation you need is genuinely low-dimensional — you're not rewriting the model's understanding of vision, you're shifting which parts of it get activated and how strongly. That's a much smaller operation than it sounds, and it's why you don't need to update everything to get there.

For medical AI this matters practically. Labeled data is expensive to collect, compute budgets are limited, and the range of clinical tasks you might want to cover is wide. LoRA makes it feasible to take one strong general model and adapt it to multiple specific tasks efficiently, without each adaptation being a separate large training job.

I'll keep iterating on this — the next post covers what happened when I took LoRA further and added a frequency-domain branch on top of it.