Machine Learning / 2021 / arXiv
Learning Transferable Visual Models From Natural Language Supervision
ImageNet had 1000 categories. The web had all of them. CLIP is what happens when you stop asking humans to label images and start using the captions that were already there.
Computer vision before 2021 had a labeling problem. To train a model to recognize dogs, you collected a few thousand pictures of dogs and a few thousand pictures of not-dogs and paid people to confirm which was which. To recognize 1000 things, you built ImageNet, a million labeled images across a thousand classes that took years to assemble and ran a Stanford lab for the better part of a decade. To recognize 22,000 things, you built ImageNet-22k. To recognize what a chest X-ray was showing, you built a separate dataset, hired separate radiologists, trained a separate classifier, and shipped it as a separate model.
The problem with this picture isn't that it doesn't work. It works fine. ResNet-50 trained on ImageNet hit 76% top-1 in 2015 and was a workhorse for the next half decade. The problem is that it doesn't scale. Every new concept you want the model to know costs human time. Every new domain you want it to handle costs a new dataset. The unit of progress is one labeled example, and labeled examples are expensive in a way that compounds: you pay annotators, you pay reviewers to check the annotators, you pay engineers to build the labeling tooling, and at the end you have a fixed taxonomy that goes stale the moment a new visual concept enters the world.
There's a deeper issue too. The taxonomy itself is a bottleneck. ImageNet's 1000 classes are an artifact of WordNet, a hand-built English ontology from the 1980s. The classes include 120 dog breeds and zero classes for meme, whiteboard, X-ray, PCB, crochet pattern, or any of the thousands of visual concepts that exist on the internet but didn't exist in the head of a 1980s lexicographer. A classifier trained on ImageNet has, by construction, a worldview from before the internet.
CLIP looked at the same problem and noticed something obvious in retrospect: the internet is full of images that already come with text describing them. Alt text, captions, surrounding paragraphs, filenames, social-media posts, product listings. Nobody labeled those images for any particular task. But collectively, they describe billions of visual concepts in the vocabulary the world actually uses. If you could train a model on (image, text) pairs scraped from the web, you wouldn't need to choose what categories to teach it. You'd get whatever vocabulary the internet uses, in whatever distribution the internet uses it.
OpenAI scraped 400 million such pairs, called the dataset WIT (WebImageText), and trained a model on it with a contrastive objective. The result classifies images into thousands of categories it was never explicitly trained on, including categories that didn't exist when the training data was collected. It transfers to 30+ benchmarks without fine-tuning. It matches ResNet-50's ImageNet accuracy without ever seeing an ImageNet label. The trick that made this work — contrastive learning between image and text encoders — is what we want to walk through.
What natural-language supervision unlocks
Before getting into the math, it's worth being precise about what changes when you replace class labels with captions. The shift sounds small. It is not.
Density of supervision. A class label is one bit of information per image, log2(1000) ≈ 10 bits if you're being generous. A caption like a Dalmatian puppy chasing a frisbee on the beach at sunset contains hundreds of bits: species, breed, age, action, object, location, time of day. Some of those bits are noisy or wrong. On average, free-form text is a vastly denser signal about what's in the image than a single class label. CLIP gets to learn from the density.
Open-vocabulary semantics. ImageNet's classifier produces a probability distribution over a fixed set of 1000 strings. Adding a class means retraining. CLIP produces an embedding, a 512-dimensional vector in a space that any English string can also be projected into. Adding a class means typing the string. The classifier becomes the language.
Free data and distributional fit
Free data scaling. The bottleneck on ImageNet was "how many things can humans agree on labels for." The bottleneck on CLIP is "how many image-text pairs can you scrape." The first scales linearly with paid annotator hours. The second scales with the size of the internet, which is, as a practical matter, free. WIT at 400M pairs is roughly 400× the size of ImageNet, and the cost was one engineering team for a few months instead of a Mechanical Turk budget for several years.
Distributional alignment. Maybe the most underrated benefit. The training distribution is the distribution of images people share with text on the internet, which is much closer to the distribution of images you'll actually want to classify than the staged, centered, well-lit photos in ImageNet. CLIP is, almost by accident, a model of how humans actually use images. ImageNet is a model of how a 2010 dataset team thought humans should use images.
The contrastive trick
CLIP has two networks. An image encoder f turns an image into a vector. A text encoder g turns a string into a vector. Both vectors live in the same d-dimensional space (in the original paper, d = 512 for the larger models). The encoders are trained jointly, from scratch, to do one thing: matching (image, text) pairs should land at nearby points in this shared space, and non-matching pairs should land far apart. That sentence is the entire idea. Everything else is implementation detail.
Here's the actual training step. Sample a batch of N image-caption pairs from WIT. Encode all N images with f. Encode all N captions with g. Normalize all 2N vectors to unit length so cosine similarity reduces to a dot product. You now have two stacks of unit vectors. Compute every image's similarity against every caption: that's an N × N matrix, where entry (i, j) is the cosine similarity between image i and caption j.
Diagonal up, everything else down
Now look at this matrix. The diagonal entries are similarities between matching pairs: image i with its own caption. The off-diagonal entries are mismatched pairs: image i with caption j where j ≠ i. The training objective wants the diagonal to be large and everything off the diagonal to be small. Treat each row of the matrix as the logits of an N-way classification problem where the correct answer is the diagonal. Apply softmax cross-entropy. Do the same for each column (because matching is symmetric). Average the two losses. That is the InfoNCE loss, and it is the entire training objective.
Notice what is not in this objective: a separate object detector, per-task labels, a class hierarchy, hard-negative mining, a curriculum, a scheduled-sampling trick. There is no auxiliary head. There is no language-model loss. There is no decoder. Just: pull matches together, push everything else apart, in a batch.
cosine similarity matrix
Same space, two encoders, one matrix
Hover a cell for the cosine similarity. Click an image row for zero-shot classification. Click a caption column for retrieval. Drag temperature to see the InfoNCE softmax sharpen onto the diagonal.
The matrix is the entire artifact. Diagonal entries — matching pairs — dominate, but the off-diagonals carry structure too: 🐕 and 🚗 share “outdoor”, 🔬 and 🍕 share “round close-up”. As τ shrinks, the row softmax snaps onto the diagonal — that’s InfoNCE pulling matches together and pushing everything else apart.
InfoNCE, written out
Let I be the matrix of image embeddings (rows are unit vectors), T the matrix of text embeddings, S = I · Tᵀ the N × N similarity matrix. The image-to-text loss is the softmax cross-entropy of each row against its diagonal target:
Lᵢ→ₜ = −(1/N) · Σᵢ log[ exp(Sᵢᵢ / τ) / Σⱼ exp(Sᵢⱼ / τ) ]
The text-to-image loss is the same with rows and columns swapped. The total loss is L = (Lᵢ→ₜ + Lₜ→ᵢ) / 2. There is one learnable temperature scalar τ, initialized at 0.07 and clipped during training so it can't collapse to zero. That's the whole math.
Two knobs control the difficulty of the loss surface: batch size N and temperature τ. Larger N gives more negatives per positive, which makes the discrimination problem harder and the gradient richer. Smaller τ sharpens the softmax: at τ → 0 the softmax becomes argmax, and the loss measures whether the matching pair is strictly the largest entry; at τ → ∞ the softmax becomes uniform and the loss carries no signal.
InfoNCE landscape
Bigger batch, harder task; lower τ, sharper softmax
Left: a synthetic similarity matrix at the chosen batch size. Right: the InfoNCE loss as a function of temperature τ. The colored curves show three batch sizes; the marker tracks your current (N, τ).
Pull τ down: the softmax sharpens, the matching pair dominates, the loss drops. Pull N up: the random-guess ceiling rises (log N), and the curve shifts up because there are more negatives to push apart. CLIP wanted both knobs — small τ (around 0.07) and gigantic N (32k+).
Why contrastive instead of generative?
Earlier vision-language work (VirTex, ICMLM) tried generative pretraining: given an image, autoregressively generate the caption. This is a stronger signal per pair but vastly more expensive per gradient step, because predicting every token of the caption is an N-way classification at every position over a vocabulary of tens of thousands. The CLIP authors tried both and found contrastive pretraining was 4× more efficient at the same compute, and 12× more efficient when you account for the simpler bag-of-words variant. The reason is intuitive: generating the exact caption is much harder than recognizing which caption out of a batch matches, and the signal you actually need for representation learning is the latter.
This is a recurring pattern in self-supervised learning. The objective doesn't have to be the thing you ultimately want; it has to be a thing whose gradient teaches the encoder to organize information well. "Match the right pair out of N" turns out to teach a vision tower remarkably well, even though you'll never use that capability directly at inference time.
The shape of the embedding space
The output of training is a single shared geometry. Every image and every string gets a point in the same 512-dimensional sphere (everything is unit-normalized, so the space is technically a sphere, not all of ℝ⁵¹²). The structure of that sphere is the model's understanding of the world.
Two images of dogs sit near each other because the captions that paired with them used overlapping vocabulary. Photo of a dog and photo of a puppy sit near each other because they paired with overlapping images. By transitivity, photo of a puppy sits near photo of a dog, near actual photos of dogs, near a golden retriever lying on grass, near photos of golden retrievers. The whole web's worth of caption co-occurrence statistics is compressed into the geometry.
What you get is something stronger than a classifier and weaker than a knowledge graph. It's a continuous map of visual concepts, where distance approximates semantic relatedness in roughly the way humans use it. That continuity is what makes CLIP useful for things the original authors didn't anticipate.
shared embedding space (PCA)
Where the matching pairs end up living
Twelve images and twelve captions, projected by PCA into the plane. At training=0 the points are random. As training advances the contrastive loss snaps each image onto its caption.
At 0% the encoders are random — image and caption positions have no relationship. At 100% matching pairs sit on top of each other and semantic neighbors cluster (animals on one side, vehicles on another). The shared space is what makes zero-shot work.
Zero-shot classification, mechanically
Once you have a shared image-text embedding space, classification becomes a question. Not "what class is this?" but "which of these sentences is closest to this image?" Here's the procedure to use CLIP as a classifier on a dataset it has never seen, with no fine-tuning.
- Pick the candidate classes. For ImageNet, that's 1000 strings: tench, goldfish, great white shark, etc.
- Wrap each class name in a prompt template: a photo of a {class}. So a photo of a tench, a photo of a goldfish, ...
- Run all 1000 prompts through the text encoder. You get 1000 unit vectors. These are your class embeddings.
- Run the test image through the image encoder. You get one unit vector.
- Compute cosine similarity between the image vector and each class vector. The argmax is your prediction.
That's it. The classifier is just sentence embeddings of the class names. There's no classification head being trained, no logistic regression fit, no fine-tuning, no labeled examples. The model is using the same machinery it used during pretraining — match an image to one of N candidate texts — and the candidate texts happen to be class names instead of scraped captions.
The headline result was that this procedure, with no training on ImageNet, hit 76.2% top-1 accuracy on ImageNet — matching the original ResNet-50 from 2015. Without ever seeing a single ImageNet label. On 16 of 27 datasets the CLIP team tested, zero-shot CLIP beat a fully supervised linear-probe baseline trained on ResNet-50 features. On some datasets (Stanford Cars, Food101, OxfordPets) the gap was 20+ points.
The robustness story is even better than the accuracy story. When you take a ResNet-50 trained on ImageNet and evaluate it on ImageNet-V2 (a re-collected test set following the original protocol), accuracy drops by ~12 points. When you evaluate it on ImageNet-Sketch or ImageNet-R or ImageNet-A (distribution shifts), it drops by 30-60 points. Zero-shot CLIP, on those same shifts, drops by 5-10 points. CLIP didn't memorize ImageNet, so it doesn't break when ImageNet's exact distribution disappears.
Prompt engineering matters more than you'd think
Here's a fun thing about CLIP that prefigures the next decade of LLM weirdness: how you phrase the class names matters a lot. tench alone gets one accuracy. a photo of a tench gets a noticeably higher accuracy. a photo of a tench, a type of fish gets higher still. a centered close-up photo of a tench might do better on some test sets and worse on others. On ImageNet alone, prompt-template choice swings accuracy by roughly 5 points.
Why? Because the text encoder was trained on web captions, and "tench" alone almost never appears in web captions. A bare noun is a query, not a description. "A photo of a tench" is much closer to the kind of string that paired with images during training. You're matching the distribution of training captions, not the literal class name. The closer your prompt looks to a real caption, the better the embedding lands in a region of the space populated by relevant images.
Ensembling 80 templates per class
The CLIP paper actually does prompt ensembling: run each class through 80 different templates (a photo of a {class}, a blurry photo of a {class}, a sculpture of a {class}, a low-resolution image of a {class}, a cropped photo of a {class}, a tattoo of a {class}, a video game of a {class}, etc.), average the resulting class vectors, and use that average as the class embedding. This squeezes out a few extra points of accuracy and is the kind of grungy detail that academic papers usually hide. The original CLIP repo has the full list of 80 templates, and reading them is genuinely funny.
There's also dataset-specific prompt customization. For satellite imagery: a satellite photo of a {class}. For sketches: a pencil sketch of a {class}. For food: a photo of a {class}, a type of food. Each adds a few points. The intuition is the same: nudge the text vector toward the part of the embedding sphere where the test images live.
prompt template shifts
Same image, same class names, different prompt
Six images, six class names, six prompt templates. Pick a template and watch how strongly each image matches each class — and how the overall accuracy moves.
Going from bare to a photo of a {class} flips images that were getting tripped up. The ensemble row is what the CLIP authors actually shipped — average the text embeddings across 80 templates and use that as the class vector.
Compositional retrieval and embedding arithmetic
Once you've internalized that CLIP text vectors live in a continuous space, the next thought is: can I do arithmetic on them? word2vec famously had king − man + woman ≈ queen. Does CLIP's space support similar moves?
Approximately yes. photo of a dog + sunglasses − no accessory gives you a vector that retrieves images of dogs in sunglasses. photo of a cat + as an oil painting gives you a vector near images that look like cat paintings. The arithmetic is rough — CLIP's space isn't strictly linear, attribute axes aren't orthogonal, and adding too many deltas blows up the magnitude in ways that hurt retrieval. But the directional intuition holds well enough to be useful for image editing, controlled generation, and retrieval interfaces that go beyond raw text matching.
embedding arithmetic
Composing prompts in CLIP space
Pick a base prompt, an attribute to add, and an attribute to subtract. The query vector is base + α·add − α·sub. The gallery below ranks by cosine similarity to that query.
With α = 0, the query is just the base prompt. As α grows, the attribute vectors push and pull the query toward images that carry those attributes. This works because text embeddings move roughly linearly in CLIP space — not perfectly, but enough to do useful retrieval.
What CLIP imports along with the data
Web-scale data is not a clean substrate. Every bias in how the internet labels images comes along for the ride. The CLIP paper is unusually honest about this; Section 6 ("Limitations") and Section 7 ("Broader Impacts") are genuinely worth reading directly, and the analysis is more careful than the typical 2021 vision paper.
Demographic bias. When you give CLIP an image of a person and ask it to classify into occupations, the predictions correlate with race and gender in the way you'd predict from internet captions. Pictures of women are more often tagged with homemaker; pictures of men with executive. When the candidate classes include non-human categories like animal or criminal, the model produces concerning misclassifications at higher rates for some demographic groups than others. The model is a fairly faithful mirror of what the internet says about who looks like what, and the internet is not a fair mirror.
Caption-quality dependence. CLIP works best on concepts that are well-described in web captions. Dog breeds, common objects, art styles, celebrities, brands, tourist landmarks — great. Specialized medical imagery, scientific diagrams, satellite photography, surveillance footage, schematics — much worse, because the relevant terms are rare in web text and the matching images are uncommon. The training distribution shapes the capability map in ways that aren't visible until you probe them. You can spend a week probing CLIP on niche domains and discover its competence map looks like a moth-eaten quilt.
Shortcut features and typographic attacks
Spurious correlations. Because the loss only requires the matching pair to be closer than the mismatching pairs in the batch, the model can solve the task with whatever shortcut is available. If captions about cows mostly come with grass backgrounds, cow and grass end up nearby in the embedding space, and a picture of a cow on a beach might get classified as a beach. The model learned the easiest distinguishing features, and "easiest" is set by the dataset, not by what we'd want. There's a paper called Typographic Attacks on CLIP that shows you can make CLIP misclassify an apple as an iPod by taping a piece of paper that says "iPod" onto the apple. The model has learned that text-in-image is a remarkably strong signal, because text-in-image is a remarkably strong signal in web data.
Fixed worldview. CLIP's embedding space is frozen at the moment of training. Concepts that emerged after the WIT scrape are, by construction, not in the model's vocabulary. If you ask a 2021-vintage CLIP model to identify a NeRF rendering or a Flux generation, it will pick the closest 2020 concept it has and call that. This is not a bug specific to CLIP — every static model has it — but it's especially salient for a model whose appeal is open-vocabulary classification.
Where CLIP went after CLIP
CLIP's bigger contribution wasn't the zero-shot ImageNet number. It was the shared embedding space, which turned out to be useful for things nobody had quite imagined when the paper came out. Five years later, the techniques CLIP introduced are load-bearing infrastructure for a chunk of modern AI.
Multimodal LLMs use CLIP-like vision encoders. When GPT-4 or Claude or Gemini look at an image you've uploaded, the image goes through a vision tower whose lineage traces back to CLIP. The pretraining objective has shifted (modern vision encoders use richer captions, mixed objectives, sometimes generative auxiliary losses, and aren't always purely contrastive), and the connection layer between vision and language has gotten more sophisticated (Q-formers, projections, cross-attention). But the architectural move — encode image into a space the language model can consume as tokens — is the CLIP move. Without CLIP demonstrating that contrastive pretraining produced a transferrable visual representation, the multimodal LLM stack would look quite different, and probably worse.
Image generation conditions on CLIP text embeddings. Stable Diffusion, DALL·E 2, the early versions of Imagen — when you typed a prompt, the prompt went through a text encoder (sometimes literally CLIP's, sometimes a sibling like OpenCLIP or T5) to produce an embedding, and the diffusion model was conditioned on that embedding through cross-attention. Generation used the alignment in reverse: instead of asking which caption is closest to this image, it asked what image would land near this caption. The fact that CLIP's space had compositional structure (sunglasses + dog ≈ dog wearing sunglasses) is precisely what lets diffusion models follow compound prompts at all.
CLIP-guided generation. Before classifier-free guidance settled into the dominant pattern, the open-source generative-art community spent 2021 building VQGAN+CLIP and similar pipelines that directly optimized images to maximize CLIP similarity to a prompt. The whole DeepDream-grandchild aesthetic of the early text-to-image era — surrealist, fractal, recursive — was downstream of CLIP's cosine similarity being usable as a loss function on pixels.
The recipe generalizes past natural images
Scientific embedding models. The same recipe — pair domain-specific images with domain-specific text, train contrastively — has been applied to chest X-rays (CheXzero, MedCLIP, BiomedCLIP), microscopy (PLIP, OpenPath), satellite imagery (RemoteCLIP), molecular structures, fashion (Marqo-FashionSigLIP), audio (CLAP), code, and recipes. In each case the appeal is the same: stop choosing categories in advance, let the literature define them, and get a useful zero-shot tool out the other side. The recipe also generalizes across modalities: ImageBind extends the contrastive trick to six modalities (image, text, audio, depth, thermal, IMU) by pairing each one with images and letting the shared image space mediate.
Retrieval and dataset curation. Because CLIP gives you cheap, dense image embeddings, it's become the standard tool for content-based image retrieval, deduplication, and dataset filtering. LAION-5B, the open dataset that powered Stable Diffusion, was filtered by CLIP similarity. Most large-scale image datasets shipped after 2021 use CLIP somewhere in the pipeline. The tool ate its own substrate.
What's still hard
CLIP is shockingly good at what is in this image and not great at where is it, how many are there, what is the spatial relationship. The contrastive loss rewards matching the gist of the caption. "A red ball to the left of a blue cube" and "a red ball to the right of a blue cube" produce nearly identical CLIP scores against the same image, because both captions match the gist. Compositionality, spatial relations, counting, attribute binding (the red one is square, the blue one is round) were all known weak spots in 2021 and remain weak spots in CLIP-style models today, even though they've been partially patched in modern multimodal systems through richer captions and additional grounding objectives.
CLIP is also, in a strict sense, a representation-learning result, not a generation result. It tells you whether an image and a caption match. It doesn't write captions or draw images on its own. That's why the more recent multimodal systems pair CLIP-like encoders with autoregressive language models or with diffusion models. CLIP gives you the embedding; something else does the generation. A generation of "vision-language models" that try to do everything in one stack (Flamingo, Qwen-VL, etc.) have absorbed CLIP's pretraining ideas but generally still keep a CLIP-derived vision tower at the bottom.
And then there's the data-rights question, which CLIP largely sidestepped in 2021 and which the field has not resolved. WIT was scraped without consent of the photographers, artists, or subjects. So was LAION. So was every successor dataset. The legal status of training on web-scale image-text pairs is being litigated in real time, and whatever the answer is, it will reshape how the next CLIP gets built. "Just scrape the web" was free in 2020 and it might not be free in 2027.
If you read the original
The paper is long (48 pages with appendices) and reads more like a technical report than a typical conference paper. Read Section 2 for the method (it's short and clear, with a 14-line pseudocode block that is the entire training loop). Read Section 3 for the zero-shot results, especially the per-dataset breakdowns in 3.1.4 and the prompt-engineering ablation in 3.1.4. Skim Section 4 (representation learning evaluation) unless you care about linear-probe numbers. Read Section 6 (limitations) and Section 7 (broader impacts) directly — they're better than most ML ethics sections from that era.
If you want to feel CLIP work, install open_clip, load a pretrained checkpoint, and run zero-shot classification on an image of your choice with class names you make up on the spot. The fact that it correctly distinguishes a photo of my cat from a photo of someone else's cat is unsurprising. The fact that it correctly classifies an x-ray of a fractured tibia without ever being told what an x-ray is — that's the moment the trick lands. And once it lands, the rest of the multimodal stack stops looking magical and starts looking like an obvious consequence of having a shared embedding space and enough internet to fill it.