Why Question Convolutions?
Convolutional Neural Networks succeed because of strong inductive biases: locality and translation equivariance. These assumptions work well for vision—but they also limit flexibility. Vision Transformers challenge this design by removing spatial bias entirely. My goal was not to beat CNN benchmarks, but to understand what vision models learn when we stop hard-coding assumptions and let attention discover structure from data.
From Pixels to Tokens
A Vision Transformer begins by converting an image into a sequence of fixed-size patches. Instead of explicitly slicing patches with loops, I used a mathematically equivalent and more efficient approach: a Conv2D layer with kernel size and stride equal to the patch size. This single operation extracts patches and projects them into the embedding space using highly optimized GPU kernels.
Restoring Spatial Information
Unlike CNNs, Transformers have no built-in notion of spatial order. To compensate, I added learnable positional embeddings to each patch token. These embeddings allow the model to reason about relative spatial relationships while preserving the flexibility of attention-based processing.
Global Context from the First Layer
Self-attention allows every patch to interact with every other patch in a single layer. Unlike CNNs, which expand their receptive field gradually, ViTs model global relationships immediately. This means the model can directly associate distant regions—such as edges, textures, or object parts—without relying on depth alone.
Learning to Classify with a [CLS] Token
For image-level classification, I adopted the [CLS] token strategy from NLP Transformers. A learnable classification token is prepended to the patch sequence and participates in self-attention. Over training, it learns to aggregate information from all patches into a single representation used by the final MLP head.
Training Dynamics and Stability
Training ViTs from scratch is notoriously sensitive. I observed that careful weight initialization, learning-rate warmup, and proper normalization were essential for stable convergence. Without strong inductive biases, ViTs rely more heavily on data and optimization choices, reinforcing why large-scale pretraining is critical in production systems.
Due to limited compute and dataset scale, experiments were conducted on a reduced dataset, which clearly demonstrated how Vision Transformers rely heavily on data scale and regularization to generalize effectively.
What This Experiment Revealed
While my implementation was not designed to beat state-of-the-art CNNs, it confirmed key insights from the original paper: Vision Transformers trade inductive bias for flexibility. When data and compute are sufficient, attention-based models can match—and sometimes surpass—convolutional approaches by learning structure directly from data.
Key Takeaways
Building a Vision Transformer from scratch clarified how architectural assumptions shape learning behavior. CNNs encode strong priors; Transformers defer structure to data. This project strengthened my intuition around attention, optimization, and why modern vision systems increasingly combine both paradigms in hybrid architectures.
