Vision Transformer (ViT) from Scratch

Re-deriving 'An Image is Worth 16×16 Words' to understand inductive bias, global context, and attention in vision.

2025-10-1825 min read
Computer VisionTransformersAttentionVision TransformersPaper Reproduction
01

Why Question Convolutions?

Convolutional Neural Networks succeed because of strong inductive biases: locality and translation equivariance. These assumptions work well for vision but they also limit flexibility. Vision Transformers challenge this design by removing spatial bias entirely.

My goal was not to beat CNN benchmarks, but to understand what vision models learn when we stop hard-coding assumptions and let attention discover structure from data.

IMAGE
Abstract representation of image patches as tokens

CNN vs Vision Transformer image transformation comparison.(Source: OpenAI)

02

From Pixels to Tokens

A Vision Transformer begins by converting an image into a sequence of fixed-size patches. Instead of explicitly slicing patches with loops, I used a mathematically equivalent and more efficient approach: a Conv2D layer with kernel size and stride equal to the patch size. This single operation extracts patches and projects them into the embedding space using highly optimized GPU kernels.

IMAGE
Case study visual

Patch extraction and projection in a Vision Transformer.(source: Original paper)

03

Restoring Spatial Information

Unlike CNNs, Transformers have no built-in notion of spatial order. To compensate, I added learnable positional embeddings to each patch token. These embeddings allow the model to reason about relative spatial relationships while preserving the flexibility of attention-based processing.

LATEX
X=Xpatch+EposX = X_{patch} + E_{pos}

Learnable positional embeddings inject spatial awareness into the token sequence.

04

Global Context from the First Layer

Self-attention allows every patch to interact with every other patch in a single layer. Unlike CNNs, which expand their receptive field gradually, ViTs model global relationships immediately. This means the model can directly associate distant regions, such as edges, textures, or object parts (like eyes, nose, etc.) without relying on depth alone.

LATEX
Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Scaled dot-product attention enables global information flow.

05

Learning to Classify with a [CLS] Token

For image-level classification, I adopted the [CLS] token strategy from NLP Transformers. A learnable classification token is prepended to the patch sequence and participates in self-attention. Over training, it learns to aggregate information from all patches into a single representation used by the final MLP head.

IMAGE
CLS token interaction diagram

The [CLS] token gathers global image information through attention.

06

Training Dynamics and Stability

Training a Vision Transformer from scratch proved to be highly sensitive to optimization choices. Stable convergence required careful weight initialization, learning-rate warmup, and consistent normalization.

Unlike CNNs, ViTs lack strong built-in inductive biases, which makes training behavior far more dependent on data quality and optimization strategy.

Due to limited compute and dataset size, experiments were run on a reduced-scale setup.

This constraint made the trade-offs explicit and reinforced a key insight from the original paper: Vision Transformers rely heavily on scale and regularization to generalize well, which explains why large-scale pretraining is essential in production settings.

IMAGE
Training curves visualization

Optimization choices play a larger role in Transformer-based vision models.

07

What This Experiment Revealed

While my implementation was not designed to beat state-of-the-art CNNs, it confirmed key insights from the original paper: Vision Transformers trade inductive bias for flexibility. When data and compute are sufficient, attention-based models can match and sometimes surpass convolutional approaches by learning structure directly from data.

IMAGE
Model comparison illustration

Understanding behavior mattered more than benchmark scores.

08

Key Takeaways

Building a Vision Transformer from scratch clarified how architectural assumptions shape learning behavior. CNNs encode strong priors; Transformers defer structure to data. This project strengthened my intuition around attention, optimization, and why modern vision systems increasingly combine both paradigms in hybrid architectures.

IMAGE
Engineering workflow visual

From first principles to modern vision systems.