Programming Language Identification at Scale

Why Language Identification Is Harder Than It Looks

Automatically identifying the programming language of a code snippet seems trivial—until you encounter real-world code. Short snippets, mixed syntax, shared keywords (e.g., `class`, `def`, `{}`), and configuration files blur the boundaries between languages. Existing tools often rely on heuristics or file extensions, which fail in notebooks, chat interfaces, and pasted code blocks.

IMAGE

Short, context-free code snippets break rule-based language detection systems.

Treating Code as a Language Modeling Problem

I framed programming language identification as a sequence classification task by treating source code as structured text. Instead of using hand-crafted rules or file-based heuristics, I fine-tuned a Transformer-based CodeBERTa-small-v1 model to learn language-specific syntax and token patterns directly from data. Since the base model was already trained on languages like Python, PHP, and Ruby, it generalized well to short snippets, partial code blocks, and inconsistently formatted real-world code

LATEX

\hat{y} = \arg\max_{c \in C} \; P(c \mid x)

Language detection as a multi-class sequence classification problem.

Curating a Multi-Language Code Dataset

The model was trained on a diverse corpus spanning 100+ programming languages rosetta code dataset, including Python, JavaScript, C++, Java, Go, Rust, and more. Special care was taken to balance languages and include short, noisy snippets—reflecting how code actually appears in developer tools, chats, and documentation.

PYTHON

1# Example preprocessing step
2from collections import Counter
3label_counts = Counter(new_dataset['language_name'])
4threshold = 500
5labels_to_keep = []
6for label,count in label_counts.items():
7    if count> threshold:
8        # print(label,count)
9        labels_to_keep.append(label)
10# def filter_labels(examples):
11#     return {'language_name': [label for label in examples['language_name'] if label in labels_to_keep]}
12
13new_dataset = new_dataset.filter(lambda example: example['language_name'] in labels_to_keep)
14new_dataset

Keeping only the languages with more than 500 snippets.

Model Architecture and Training

I fine-tuned a Transformer encoder (~83.5M parameters) for multi-class programming language classification. Instead of relying on explicit grammar rules or file extensions, the model learns language identity from token-level patterns and structural cues present in source code. Training emphasized robustness to real-world inputs, including short snippets, incomplete syntax, and overlapping keywords shared across languages.

IMAGE

Transformer encoders excel at capturing long-range token dependencies.

Inference Optimization for Real-World Usage

To make the model easy to use in real applications, I uploaded it to Hugging Face’s model repository so it can be loaded and run with just a few lines of code.

The model has two versions: a standard PyTorch version for development and experimentation, and an pull-request approved ONNX version for faster and more efficient inference in production systems.

PYTHON

1import torch
2from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
3model_name = 'philomath-1209/programming-language-identification'
4loaded_tokenizer = AutoTokenizer.from_pretrained(model_name)
5loaded_model = AutoModelForSequenceClassification.from_pretrained(model_name)
6
7
8device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
9text = """
10PROGRAM Triangle
11   IMPLICIT NONE
12   REAL :: a, b, c, Area
13   PRINT *, 'Welcome, please enter the&
14            &lengths of the 3 sides.'
15   READ *, a, b, c
16   PRINT *, 'Triangle''s area:  ', Area(a,b,c)
17  END PROGRAM Triangle
18  FUNCTION Area(x,y,z)
19   IMPLICIT NONE
20   REAL :: Area            ! function type
21   REAL, INTENT( IN ) :: x, y, z
22   REAL :: theta, height
23   theta = ACOS((x**2+y**2-z**2)/(2.0*x*y))
24   height = x*SIN(theta); Area = 0.5*y*height
25  END FUNCTION Area
26
27"""
28inputs = loaded_tokenizer(text, return_tensors="pt",truncation=True)
29with torch.no_grad():
30  logits = loaded_model(**inputs).logits
31predicted_class_id = logits.argmax().item()
32loaded_model.config.id2label[predicted_class_id]
33

Inference using the Hugging Face pipeline API.

Adoption and Open-Source Impact

After releasing the model publicly on Hugging Face, it gained rapid adoption and is now used by 350k+ developers worldwide. The project demonstrated how a focused, well-scoped ML system can deliver immediate real-world value when paired with strong documentation and easy deployment.

IMAGE

HF model page for programming language identification model.

Constraints and Trade-offs

While the architecture could scale further, training was bounded by practical compute limits. This reinforced a key lesson: for classification tasks, data quality and coverage often matter more than pushing parameter count, especially when optimizing for inference efficiency.

LATEX

\text{Performance} \approx f(\text{data quality}, \text{task fit})

Scaling laws matter—but only when aligned with the problem.

Key Learnings

This project strengthened my understanding of framing ML problems correctly, designing for production constraints, and shipping models that developers can actually use. It also reinforced the importance of open-source as a force multiplier for real-world impact.

IMAGE

Generic image of developers working on code.

IMAGE

Short, context-free code snippets break rule-based language detection systems.