From Scratch: Building GPT-2 for Efficient Language Modeling

Re-implemented GPT-2 (124M parameters) from the ground up using PyTorch, faithfully reproducing its architecture as introduced in Attention Is All You Need. Core components include tokenization, token and positional embeddings, multi-head self-attention, feed-forward layers, and residual connections. The model was trained on the FineWeb dataset and evaluated using the HellaSwag benchmark. Achieved 26% accuracy on HellaSwag after just 2 days of training on a single NVIDIA A6000 GPU, closely approaching OpenAI’s official GPT-2 score of 28.92%, which required extensive compute on NVIDIA V100 clusters over weeks. This project demonstrates how efficient architectural replication and strategic training schedules—inspired by the original transformer design—can yield near-benchmark performance under significant resource constraints.

Key Features

Training Setup

Parameter This implementation OpenAI's 124M parameters
Iterations 3400 19073
Warmup Steps 132 715
Batch Size 32 128
Total Batch Size 524,288 524,288
Sequence Length 512 1024
Vocabulary Size 50304 50304

Results

Result Image 1 Result Image 2 Result Image 3 Result Image 4
GitHub Repository