Vision Transformer (ViT) from Scratch

Re-deriving 'An Image is Worth 16×16 Words' to understand inductive bias, global context, and attention in vision.

2025-10-1825 min read
Computer VisionTransformersPyTorchResearch Implementation
01

Why Question Convolutions?

Convolutional Neural Networks succeed because of strong inductive biases: locality and translation equivariance. These assumptions work well for vision—but they also limit flexibility. Vision Transformers challenge this design by removing spatial bias entirely. My goal was not to beat CNN benchmarks, but to understand what vision models learn when we stop hard-coding assumptions and let attention discover structure from data.

IMAGE
Abstract representation of image patches as tokens

Vision Transformers treat images as sequences instead of grids.

02

From Pixels to Tokens

A Vision Transformer begins by converting an image into a sequence of fixed-size patches. Instead of explicitly slicing patches with loops, I used a mathematically equivalent and more efficient approach: a Conv2D layer with kernel size and stride equal to the patch size. This single operation extracts patches and projects them into the embedding space using highly optimized GPU kernels.

PYTHON
1class PatchEmbedding(nn.Module):
2      def __init__(self, in_channels, patch_size, embed_dim):
3          super().__init__()
4          self.proj = nn.Conv2d(
5              in_channels,
6              embed_dim,
7              kernel_size=patch_size,
8              stride=patch_size
9          )
10  
11      def forward(self, x):
12          x = self.proj(x)              # (B, D, H/P, W/P)
13          x = x.flatten(2).transpose(1, 2)  # (B, N, D)
14          return x

Patch extraction and projection using a single Conv2D layer.

03

Restoring Spatial Information

Unlike CNNs, Transformers have no built-in notion of spatial order. To compensate, I added learnable positional embeddings to each patch token. These embeddings allow the model to reason about relative spatial relationships while preserving the flexibility of attention-based processing.

LATEX
X=Xpatch+EposX = X_{patch} + E_{pos}

Learnable positional embeddings inject spatial awareness into the token sequence.

04

Global Context from the First Layer

Self-attention allows every patch to interact with every other patch in a single layer. Unlike CNNs, which expand their receptive field gradually, ViTs model global relationships immediately. This means the model can directly associate distant regions—such as edges, textures, or object parts—without relying on depth alone.

LATEX
Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Scaled dot-product attention enables global information flow.

05

Learning to Classify with a [CLS] Token

For image-level classification, I adopted the [CLS] token strategy from NLP Transformers. A learnable classification token is prepended to the patch sequence and participates in self-attention. Over training, it learns to aggregate information from all patches into a single representation used by the final MLP head.

IMAGE
CLS token interaction diagram

The [CLS] token gathers global image information through attention.

06

Training Dynamics and Stability

Training ViTs from scratch is notoriously sensitive. I observed that careful weight initialization, learning-rate warmup, and proper normalization were essential for stable convergence. Without strong inductive biases, ViTs rely more heavily on data and optimization choices, reinforcing why large-scale pretraining is critical in production systems.

Due to limited compute and dataset scale, experiments were conducted on a reduced dataset, which clearly demonstrated how Vision Transformers rely heavily on data scale and regularization to generalize effectively.

IMAGE
Training curves visualization

Optimization choices play a larger role in Transformer-based vision models.

07

What This Experiment Revealed

While my implementation was not designed to beat state-of-the-art CNNs, it confirmed key insights from the original paper: Vision Transformers trade inductive bias for flexibility. When data and compute are sufficient, attention-based models can match—and sometimes surpass—convolutional approaches by learning structure directly from data.

IMAGE
Model comparison illustration

Understanding behavior mattered more than benchmark scores.

08

Key Takeaways

Building a Vision Transformer from scratch clarified how architectural assumptions shape learning behavior. CNNs encode strong priors; Transformers defer structure to data. This project strengthened my intuition around attention, optimization, and why modern vision systems increasingly combine both paradigms in hybrid architectures.

IMAGE
Engineering workflow visual

From first principles to modern vision systems.