Machine Learning / 2021 / arXiv

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models — Hu et al.

Full finetuning of a 70B model means storing 70B new weights for every task. LoRA's bet: you don't actually need most of those numbers to change. The useful direction is small enough to fit on a USB stick.

Read original paper Back to papers

Suppose you have a pretrained language model with 70 billion parameters and you want to teach it to write in your company's voice. The textbook answer is: finetune. Take the model, run gradient descent on your in-house corpus, get a new set of 70 billion weights, deploy.

Now suppose you want to teach it ten different voices for ten different customers. Or fifty. Or you want every user on a phone to have their own personalized model. Now you're storing 70 billion parameters per copy, paying GPU hours to train each one, and shipping a 140-gigabyte file every time someone asks for a small adjustment. The economics of full finetuning fall apart somewhere around the third task.

LoRA's claim — and it's a claim about empirical fact, not just a clever trick — is that the change you need to make to the weights for a specific task is much, much smaller than the weights themselves. Most of the 70 billion parameters don't need to move. The useful task-specific update lives in a tiny subspace, and you can train just that subspace without touching the rest.

The version of this idea that ate the field is exactly four lines of math: freeze W, set ΔW = BA with two skinny matrices, learn B and A, ignore the rest. The 2021 paper by Hu and collaborators is short. The downstream consequences — QLoRA, DoRA, mixture-of-LoRAs, adapter routing, on-device personalisation — are still rippling out.

Before LoRA, the menu of options for cheap finetuning was thin and unsatisfying. Adapter layers (Houlsby et al. 2019) inserted small bottleneck modules between transformer blocks; they worked, but added inference latency that compounded across layers. Prompt tuning (Lester et al. 2021) prepended learned soft tokens to the input; cheap, but capped in expressivity and awkward to batch. BitFit (Ben Zaken et al. 2021) only updated bias terms; surprisingly effective on small models, brittle at scale. Each of these had a real downside that kept it from becoming the default. LoRA didn't have any of those downsides, and that's most of why it took over.

What 'low rank' actually means here

Pick a single weight matrix in the model. Say W is the d × k projection inside one attention layer. With d = k = 4096 (typical for a mid-size model), that's 16 million numbers in this one matrix. Full finetuning would adjust all of them: it would learn a delta matrix ΔW of the same 4096 × 4096 shape, and the new weights would be W + ΔW.

LoRA says: don't learn ΔW directly. Instead, factor it as the product of two skinny matrices. Let B be d × r and A be r × k, where r is small — maybe 8, maybe 16. Define ΔW = B A. The product is still d × k, so it can be added to W exactly as before. But the parameters you train are just B and A.

Count: with r = 8, B has 4096 × 8 = 32,768 parameters and A has 8 × 4096 = 32,768 parameters. Total: 65,536. Compared to the 16,777,216 parameters of full ΔW, that's a 256× reduction. Across a whole 70B model, you go from 70 billion trainable parameters to a few tens of millions — small enough to email.

Scale this up to GPT-3. The MLP projections inside each transformer block in GPT-3 are d = k = 12,288. Full ΔW for one of those layers is 150 million parameters. A rank-8 LoRA on the same layer is 2 × 12,288 × 8 = 196,608 — about 750× smaller. Multiply by 96 layers and you've moved the trainable parameter count of the entire model from 175 billion to roughly 18 million. The original LoRA paper reported finetuning GPT-3 with around 0.01% of the original parameter count and matching full-finetune quality.

There's a forward-pass version of the picture that helps. The original layer computes y = Wx. The LoRA-augmented layer computes y = Wx + BAx. Same input, same output shape, with one extra term. At training, you compute that term explicitly: Ax takes the k-dimensional input down to r dimensions, then B lifts it back up to d. Two skinny matrix multiplies sandwiching a tiny bottleneck. The bottleneck is the whole point — it's where the rank constraint lives.

A useful sanity check: with rank-1 LoRA, B is a column vector and A is a row vector, and BA is their outer product — a single rank-1 matrix. With rank-2 you get a sum of two outer products. The general picture is that LoRA adds a rank-r bump to W assembled out of r rank-1 directions that the optimiser gets to choose. The rank is the budget; the directions are what training spends it on.

The interactive: watch a delta collapse

truncated SVD = optimal LoRA

Useful task updates can live in a tiny subspace

Pick a structured 32×32 ΔW. We compute its SVD and reconstruct ΔW_r = BA using only the top r singular components — the Eckart–Young optimum. Slide r and watch reconstruction error fall off a cliff long before r=32.

The gold bars are ΔW’s singular values — the actual task update has only a few non-trivial directions, so a small r captures it almost exactly. At r = 4, BA uses 256 parameters instead of 1024 and reconstructs ΔW with 93.5% Frobenius error. This is why finetuning a frozen base with rank-8 adapters works: useful corrections are intrinsically low-rank.

The factorization isn't about hiding work — it's about admitting that useful task updates have low effective rank in the first place. Pick a structured ΔW, slide r, watch reconstruction error fall off a cliff well below r = 32.

Why does this even work?

If task adaptation genuinely required 16 million degrees of freedom per layer, LoRA would fail. The fact that it works tells us something interesting about the geometry of pretrained models.

Intrinsic dimension is small for finetuning

The intuition starts with intrinsic dimension. Li et al. (2018) showed that you can train a neural net by optimising in a randomly chosen low-dimensional subspace of weight space, and you'll often hit reasonable performance. They picked a random D-dimensional affine slice through the parameter space and ran gradient descent inside it. For many tasks, D = a few thousand gets you most of the way to full performance — even when the actual model has hundreds of millions of parameters. The implication: most directions in weight space don't matter. Useful learning lives in a much smaller manifold than the parameter count suggests.

Aghajanyan, Zettlemoyer, and Gupta sharpened this for finetuning specifically in their 2020 paper Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. The headline finding: bigger pretrained models have lower intrinsic finetuning dimension. RoBERTa-large needs around 200 trainable directions to hit 90% of its full-finetune accuracy on MRPC. RoBERTa-base needs roughly 800. The act of pretraining doesn't just teach a model facts — it carves a manifold in weight space along which useful task-specific motion is cheap. This is the empirical floor that LoRA stands on.

Pretrained models amplify this. A model that has already been trained on a trillion tokens of internet text has, in some sense, learned almost everything it's going to learn. The remaining task adaptation is a small course correction. Finetuning the full network is like changing the rotation of a ship by repainting it; the LoRA observation is that you only need to nudge the rudder.

Real ΔW has low effective rank

More formally: the original LoRA paper checked, empirically, the rank of the full finetuning update across many tasks. They computed the singular values of ΔW after running full finetuning, and found that the top few singular values capture most of the change. The update has high-dimensional shape but low effective rank. So a low-rank parameterisation isn't an approximation that gives up performance — it's a parameterisation that matches the actual structure of the update.

empirical low-rank structure

Real task deltas concentrate in a few directions

Four candidate ΔW matrices for one layer. Three are synthesised with the structure of finetuning updates (rank-2 to rank-4 outer products, fast-decaying spectrum). The fourth is IID noise. Their singular-value curves tell the story.

legal voice ΔW

90% energy at r = 3

medical voice ΔW

90% energy at r = 2

code style ΔW

90% energy at r = 2

iid noise (control)

90% energy at r = 15 (full rank)

The three task deltas drop to near-zero by component r ≈ 4: a tiny subspace carries almost all the action. The noise control spreads its energy across all 28 components — full rank with no compressible structure. Real finetuning updates look like the first three, not the fourth.That’s the empirical bet that LoRA cashes in.

Three task-shaped ΔW matrices versus a control of IID noise. The task deltas concentrate almost all their energy in the first few singular components; the noise spreads evenly across all of them. Real finetuning updates look like the first three.

How far can you compress before things break?

There's a knee in the loss-vs-rank curve, and finding it is the practical question every LoRA user lives with. Too low and the adapter can't represent the task. Too high and you're back to expensive finetuning with a slightly different name.

The shape of the knee depends on the task. A simple style change — write more formally, hedge more — looks like the shift example: rank 2 or 3 is plenty. A domain shift — generic English to medical English — is closer to the blur example: maybe rank 8 to 16. A task that genuinely requires new latent capabilities — switching to a new language family, learning code in an architecture that hasn't seen code — runs into the rotate failure: LoRA at small rank simply can't get there, and you're better off with a higher rank or a different method.

rank knee on a synthetic task

How fast does loss bottom out as rank grows?

Pick a task. We fit ΔW = BA at every rank from 1 to 24 using the Eckart–Young optimum (the SGD fit on this convex regression). The teal curve is task error; the gold curve is parameter count.

knee (5% rel-loss) at r = 24 · params at knee 1152 / 576 · 0.5× smaller than full ΔW

rank 2 task — a cyclic shift is structurally tiny. The teal curve drops off a cliff long before r = 24. Past the knee, you spend parameters without buying accuracy — that’s the empirical signature of a low-intrinsic-rank task. LoRA exploits exactly this gap.

Sweep rank against task loss for three synthetic tasks. The teal curve plateaus far below full rank for the easy tasks; for the hard one (block rotations) it doesn't plateau at all. Where the curve goes flat is the rank you actually need.

Operationally: you sweep. Start at r = 8, evaluate on a held-out set, double if you're underfitting, halve if you're overfitting. The original LoRA paper found r = 4 sufficed for many GPT-3 finetuning tasks; modern open-source recipes default to r = 16 or r = 32 as a safe middle. There's an entire research thread (AdaLoRA, IA³, DoRA) on choosing the rank adaptively per layer. They mostly work, but the simple constant-rank recipe is hard to beat for the time you spend tuning it.

Worth pausing on what rank means geometrically. A rank-r matrix can be written as a sum of r outer products: ΔW = σ₁ u₁ v₁ᵀ + σ₂ u₂ v₂ᵀ + ... + σ_r u_r v_rᵀ. Each outer product is a rank-1 'pattern' — a way to take some component of the input (selected by vᵢ) and write it back into the output (with shape uᵢ). A rank-r update is a small library of r such patterns. When you apply ΔW to an input x, you compute r dot products (one per vᵢ), scale each by σᵢ, and recombine using the uᵢ basis. Rank limits how many independent patterns the update can carry, not how strong each pattern is.

If finetuning a model amounts to teaching it a small set of new patterns — route this kind of token to that kind of attention head, map this style cue to that output bias — then rank-r is exactly the right abstraction for the size of that set. Aghajanyan's intrinsic-dimension result is essentially the empirical claim that r is small for finetuning. LoRA is the architectural trick that turns that claim into a parameter count.

Initialisation: why B = 0 and A is random

You freeze W. You initialise B to zero and A to small Gaussian random values. This asymmetric init looks weird at first glance — why not just initialise both to small random numbers? — and the answer is one of those small details that quietly does a lot of work.

At step zero, you want the model to behave exactly like the pretrained base. If you injected random noise into the weights from the start, you'd be punishing the model for nothing it did wrong, and the loss would jump up before training could pull it back. So you want BA = 0 at init. There are two ways to do that: zero B and random A, or zero A and random B. They're not symmetric.

Walking through the three init choices

If you zero both, the gradient through the product is also zero. ∂(BA)/∂B = A and ∂(BA)/∂A = B. With both at zero, you sit at a saddle point and gradient descent never escapes.

If you zero A and randomise B, the gradient on A gets multiplied by B (which is full-magnitude random). That's fine for A. But the gradient on B gets multiplied by A, which is zero — so B doesn't move, and you've decoupled half the parameters from training until A drifts away from zero on its own.

Zeroing B and randomising A is the only choice that gives both factors immediate gradient flow while still starting from BA = 0. The first gradient step makes B move (because A is non-zero) and makes A move (because the chain-rule term routes through the loss). Both factors update, the model starts at the pretrained checkpoint, no early-training spike. Small detail, real consequence.

The asymmetric init also generalises. In any low-rank parameterisation where you want to start from an identity / zero perturbation, the recipe is: zero one factor, randomise the other, pick whichever choice gives you healthy gradients on both. The same logic shows up in residual connection design (zero-init the residual branch in some architectures) and in some MoE gating schemes. Once you see it, you see it everywhere.

How training works in practice

Forward pass: when the layer would normally compute Wx, it instead computes Wx + (α/r) BAx. Backward pass: gradients flow only into B and A, since W is frozen. The optimiser state — Adam's first and second moments — only needs to be maintained for B and A. Since they're small, training memory drops dramatically.

This is the second big practical win and people sometimes underweight it. LoRA isn't just storage-cheap, it's training-cheap. The optimiser-state savings are often bigger than the parameter savings on the GPU. A 70B model in fp16 takes 140 GB of weight memory; full finetuning needs another ~1 TB just for Adam state. With LoRA, you're storing Adam state for ~20 million parameters instead of 70 billion — a few hundred megabytes. You can finetune a 70B model on a single 80 GB GPU because the math works out. Without LoRA, that finetune requires multi-node distributed training. The accessibility shift is enormous.

Activations still cost what they cost — you're still running the full forward pass through W. So LoRA doesn't make training faster per step in the way that, say, gradient checkpointing does. What it makes cheaper is memory, which is what was actually pinning finetuning to expensive hardware. Speed comes for free as a second-order effect: smaller optimiser state means less data movement on the GPU, and fewer gradients to communicate in distributed settings.

The merge-at-inference trick

Here's the part that makes LoRA practical at deployment scale. At inference time, you don't have to keep B and A as separate matrices. You can compute ΔW = BA once, add it to W, and store the merged result. Now the model has the same architecture and the same compute cost as the original — there's no extra latency, no extra parameter, nothing for an inference engine to special-case.

Compare this to other adapter methods. Approaches like Houlsby adapters insert new modules into the model, so inference has to run them at every forward pass — extra layers, extra latency, kernel boundaries the compiler can't fuse through. Prompt tuning prepends learned tokens to the input, which lengthens every sequence and complicates batching. LoRA's matrix-multiply structure means the adapter dissolves into the existing weights at deploy time. Zero inference overhead. This is a big part of why LoRA won and the alternative adapters mostly didn't.

the merge-at-inference trick

Same forward pass, zero adapter overhead

At training, you keep W frozen and learn B, A. At deployment, you compute ΔW = BA once and add it: W′ = W + BA. The model now has the original architecture and the original FLOPs.

Adapter methods like Houlsby insert extra modules that run at every forward pass. LoRA’s product structure means at deploy time the adapter dissolves into W. No extra layers, no extra latency — the whole point of why this design won.

Step through training, the merge step, and deployment. The training-time picture has B and A as separate skinny matrices; the deployed picture is a single merged W'. FLOPs per token at inference are identical to the original model.

The flip side: if you want to swap adapters — serve fifty users each with their own LoRA, dynamically — you don't merge. You keep W shared and store the small (B, A) per user, applying them at runtime. The overhead is small (an extra rank-r matrix multiply per layer, which can often be fused with the main matmul), and the storage win is huge: each user's LoRA is megabytes, not hundreds of gigabytes. Vendors like vLLM and TGI now ship multi-LoRA serving as a first-class feature: one base model in GPU memory, hundreds of user-specific adapters streamed in and out per request.

The merge trick is also reversible. If you stored (B, A), you can always go back to the original W by subtracting. So you can stack and unstack adapters at deploy time — apply a domain-specific LoRA, then a style-specific LoRA on top, then peel one off when the user changes context. The math is just addition and subtraction in weight space.

A worked example helps make the merge math concrete. Suppose W is 4096×4096 with random pretrained values, and you've trained B (4096×8) and A (8×4096). To merge: compute BA (one matrix multiply, 4096 × 8 × 4096 ≈ 134M multiplies — under a millisecond on any modern GPU), then add it to W in place. The result is a 4096×4096 matrix indistinguishable in shape from the original. You can save it to disk, load it into any inference framework, and run as if you'd trained the whole network. The adapter has stopped being a separate object and become part of the weights.

Where to apply it

Not every layer benefits equally. The original paper found that applying LoRA to the attention Q and V projections gave the best return on parameter count. The MLP layers can also benefit but require more rank for the same effect. The intuition: attention is where information routing happens, and routing decisions are where most task-specific behaviour gets encoded.

Hu et al. ran the ablation with extreme rank-1 LoRAs across various combinations of attention projections and showed that Q + V matched or beat the alternatives. K alone was the worst — keys are about what to attend to, and pretraining seems to do that job well enough that the task delta lives mostly in what gets routed (Q) and what gets read (V). It's a small empirical fact that hints at where information lives in attention.

What modern recipes actually do

Modern practice has converged on something pragmatic: apply LoRA to all linear layers — Q, K, V, output projection, both MLP projections — use rank 8 to 64 depending on how different the target task is from pretraining, and let the optimiser figure out how to allocate. The hyperparameter that matters most is α / r — the scaling factor on the LoRA contribution. In practice, you tune it like a learning rate.

There's also a question of which layers by depth to adapt. Empirically, adapting all of them works best, but the marginal value tends to be highest in the middle layers — early layers handle generic linguistic features, late layers are close to the output and don't need much tweaking, the middle is where task-specific reasoning lives. If you have a tight parameter budget and want to be surgical, target the middle third.

There's also a question of which adapter format you ship. The community converged on PEFT (Hugging Face's parameter-efficient finetuning library) as a de facto standard, with a small JSON config plus the (B, A) tensors stored as a .safetensors file. A typical rank-16 LoRA for a 7B model fits in roughly 10–40 MB depending on which layers you targeted. That's the size of a high-resolution photograph. People email these around. People version-control them. The unit of finetuning has shrunk to something a single engineer can hold in their head and a single git repo can track.

QLoRA, and the moment LoRA ate the world

The 2023 follow-up that made LoRA the default: QLoRA (Dettmers et al.). Quantize the frozen base model to 4-bit, keep the LoRA adapters in 16-bit, and finetune. The base model now takes a quarter of the VRAM. The adapters are still small. You can finetune a 65B model on a single 48 GB GPU.

Quantisation and low-rank adaptation play unusually well together, which isn't obvious in advance. Quantising weights normally creates a problem for finetuning: gradients want to make small adjustments, but quantised weights snap to a discrete grid. You either have to dequantise during the backward pass (expensive) or accept that you can't update quantised weights smoothly (not a finetune). LoRA sidesteps this entirely. The quantised weights never get updated — the adapter is what's learning. So you can quantise the base aggressively without breaking training. The two techniques are made for each other.

The trick that makes QLoRA work isn't just naive 4-bit quantisation. Dettmers and collaborators introduced NF4 (NormalFloat 4-bit) — a quantisation scheme adapted to the empirical distribution of pretrained weights, which look approximately Gaussian. They added double quantisation (quantising the quantisation constants themselves) and paged optimisers (offloading optimiser state to CPU memory in chunks). The combined effect: a 65B model that would have needed multiple A100s to finetune now fits on a single consumer-ish GPU.

The base model lives in 4-bit; the LoRA adapters compute in 16-bit; gradients flow through dequantised activations. The base never gets updated, so quantisation error doesn't compound during training — it's a fixed perturbation that the adapter learns to work around. This is the configuration that turned LoRA from a research technique into a default tool. Open-source finetuning communities — every Llama variant, every Mistral derivative — runs on QLoRA. If you've trained a model on consumer hardware in the last two years, you used some form of LoRA.

Inside one transformer block

Walk through a single transformer block to see where LoRA touches it. The block has roughly six linear projections. Attention contributes four: W_Q, W_K, W_V, and the output projection W_O. The MLP contributes two: an up-projection (often W_up or W_gate + W_up in gated variants like SwiGLU) and a down-projection W_down. Each is a matrix you could hang a LoRA off.

If you LoRA all six at rank 16, the parameter cost per block is 2 × 16 × (sum of dimensions). For Llama-2 7B with d = 4096 and MLP hidden dim 11008, that's roughly 2 × 16 × (4096 × 4 + 4096 + 11008 × 2 + 11008) ≈ 2.7M parameters per block. Times 32 blocks = ~86M trainable parameters. The full model is 7B. So even with full-coverage LoRA, you've shrunk trainable parameters by ~80×. That's the rough budget you're working with in practice.

Adapter ecosystems and the routing question

Once adapters are cheap and storage-tiny, you can train a lot of them. Hugging Face's adapter hub holds tens of thousands of LoRAs. Stable Diffusion's user community trains LoRAs for individual artistic styles, individual characters, individual lighting conditions — a single base model with a long-tail library of small specialists.

Which raises the question: at inference, which LoRA do you apply? The simplest answer is the user picks. But you can also let the model decide. LoRAhub (Huang et al. 2023) trains a small mixture-of-experts gate on top of a library of frozen LoRAs and routes each query to a learned weighted combination. AdapterFusion learns task-conditioned attention weights over a fixed adapter library. Mixture-of-LoRAs in serving systems treats individual adapters as experts and routes per token, the same way MoE routes through experts inside a single model.

ecosystem of adapters

One base, many personalities

Three LoRA adapters specialised for legal, medical, and code voice. Pick a primary adapter, optionally blend it with a second. The right panel shows what swapping costs at storage time.

primary adapter

blend with (optional)

model output (token routing)

formal

37%

casual

clinical

13%

code

narrative

15%

hedged

29%

storage cost at 8 tasks

full finetune · 8 × 70B copies1120 GB

LoRA · 1 base + 8 adapters (18 MB each)140.1 GB

ratio: 8.0× less storage. Each new task costs 18 MB, not 140 GB.

The base model is the expensive thing. Once it’s pretrained, every personality is a small additive correction. The deployment unit shifts from “a model” to “a base + a library of adapters.” Mixture-of-LoRAs and adapter routing build on this primitive — store hundreds of skills, gate which one fires per query.

Three trained adapters for three voices. Pick one, blend two, watch the model's behaviour shift. The bottom panel makes the storage argument concrete: full finetune scales linearly with task count; LoRA stays roughly flat.

Routing changes what kind of object you're shipping. A model becomes a base, plus a registry of adapters, plus a routing rule. Each piece is independently versioned, independently auditable, independently deployable. The base might update once a year; adapters update weekly; routing rules update hourly. This decoupling looks a lot like how production software stacks evolved — kernel, libraries, applications — and there's a reasonable bet that ML systems are heading the same way.

The composability is the part that surprises people. Two LoRAs trained for different tasks can often be added — just sum the (B, A) contributions in weight space — and the model exhibits a believable mixture of both behaviours. This works because LoRA updates are small linear perturbations of a common base; superposition holds approximately. It breaks when adapters disagree about how to route attention, but for non-conflicting skills (a domain LoRA + a style LoRA), it works often enough to be a deployment pattern.

Where it goes from here

DoRA and the magnitude/direction split

DoRA (Liu et al. 2024) noticed that during full finetuning, weight updates have a particular structure: the direction of each weight column changes substantially, but the magnitude changes less. LoRA's rank-r update couples magnitude and direction, which is sometimes wasteful. DoRA decomposes each pretrained weight column into a magnitude scalar and a direction vector, applies LoRA only to the direction, and learns the magnitude separately. Small modification, consistent improvement on most benchmarks.

On-device personalisation

A LoRA is small enough to ship to a phone. You can imagine a future where the base model is a static binary that ships with the OS, and your personal adapter — trained on your messages, your writing style, your local data — lives in your private storage and is loaded at inference. The whole privacy story for personalised AI hinges on something like LoRA: the heavy intelligence is shared and frozen, the personalisation is small and local.

Apple's on-device foundation model architecture (announced 2024) uses exactly this pattern. A frozen base model on the device, swappable adapter LoRAs for tasks like email summarisation, message rewriting, calendar understanding. The OS picks the right adapter per request. None of it is novel research; all of it is LoRA hitting the production scaling ramp.

Continual learning without catastrophic forgetting

Because the base model is frozen, adapter-based finetuning can't break the original capabilities. You can stack adapters for new tasks without losing old ones — a structural fix for one of the oldest problems in neural networks. Sequential LoRA trains a fresh adapter per task and keeps a library; LoRA averaging blends adapters at inference; experiments by Wang et al. and others show this preserves old-task performance much better than full sequential finetuning.

Scientific finetuning

Apply LoRA to a frozen protein language model to specialise it on a specific enzyme family, or to a frozen genome model for one organism. The base model captures general biology; the adapter captures the niche. You couldn't afford to retrain the foundation model per family; LoRA makes it free. ESM, AlphaFold, Evo — all of them get this kind of treatment now, with task-specific LoRAs published alongside papers the way trained heads used to be.

A short tour of the failure modes

It helps to know what LoRA can't do, because the failure modes are diagnostic. The first one is the rank wall. If you try to teach a model a new language with rank 4, you'll see training loss decrease, validation loss plateau early, and outputs that produce vaguely-shaped tokens in the new script but with no coherent grammar. The model is using its eight degrees of freedom to fake the shape of the task without actually learning it. Bumping rank helps until it doesn't; past a certain point you need to either unfreeze the embedding layer or do continued pretraining.

The second is forgetting in disguise. LoRA can't damage the base, but it can override useful pretrained behaviour during inference. Train a LoRA hard on a narrow domain and the merged model becomes that domain — ask it general-knowledge questions and it'll answer in the voice of your domain even when that's wrong. The base hasn't forgotten anything; the adapter is just shouting over it. Mitigation: keep the LoRA strength (α/r) modest, mix domain data with general data during training, or serve unmerged so you can dial the adapter down at inference.

The third is rank thrash. If you set rank too high relative to the data, the adapter starts memorising — overfitting in the classic sense. Loss on training drops, validation diverges, and merged outputs become weirdly specific to training examples. Halving rank usually fixes it. The general pattern: rank acts as a regulariser. Smaller rank = stronger prior that the update is structured = better generalisation up to the point where you can't fit at all.

Limits

LoRA can't add capability that isn't already latent in the base. If the base model has never seen Python, no LoRA will teach it Python — you'd need a much higher-rank update, possibly a full continued pretrain. LoRA is for adaptation, not acquisition. The intuition: a low-rank update can rotate and reweight existing features, but it can't manufacture new ones. New features are full-rank objects.

And the choice of rank is a real tradeoff. Too low and you can't fit the task. Too high and you're back to expensive finetuning with weaker generalisation. There's no closed-form way to choose; you sweep and you guess. Adaptive-rank methods (AdaLoRA, DoRA) help, but the simple LoRA recipe at rank 8–32 remains the default that's hard to beat.

There are also stability gotchas. LoRA training can be sensitive to the α/r ratio in ways that surprise people coming from full finetuning. Setting it too high causes the adapter to dominate the base early in training and you lose pretrained knowledge; setting it too low and the adapter takes forever to learn anything. The rule of thumb α = 2r is a reasonable starting point, but plan to tune it.

LoRA can't fix a wrong base

And there's a subtler limit that doesn't show up in benchmarks but bites in production: LoRA assumes the base model is correct. If the pretrained weights have a bias you'd like to fix — a refusal pattern that's too aggressive, a knowledge cutoff that's too early, a style that's too verbose — LoRA can mask the symptom but not the cause. Your adapter learns to route around the base's behaviour, which works locally and breaks under distribution shift. For deep behavioural change, you eventually need to change the base.

One last note about why this paper matters more than its technical content. LoRA was published in mid-2021. The current open-model ecosystem — every Llama derivative, every Mistral finetune, every Stable Diffusion character LoRA, the entire bring-your-own-finetune economy — runs on top of it. Without a cheap way to specialise a frozen base, the open-weights world looks completely different. There's a small set of papers that quietly enabled an entire phase of the field, and this is one of them.

Read the paper for the empirical sections — the ablations on which layers to adapt and what ranks work are still the most useful operational guidance you'll find. And then go finetune something. The whole point of LoRA is that it's now actually accessible. A laptop, an open base model, a few thousand examples, and you have a personalised model in an afternoon. That gap, between interesting research idea and something a single engineer can do at their kitchen table, is what LoRA closed.

Read the original Next: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning