Re-implemented GPT-2 (124M parameters) from the ground up using PyTorch, faithfully reproducing its architecture as introduced in Attention Is All You Need. Core components include tokenization, token and positional embeddings, multi-head self-attention, feed-forward layers, and residual connections. The model was trained on the FineWeb dataset and evaluated using the HellaSwag benchmark. Achieved 26% accuracy on HellaSwag after just 2 days of training on a single NVIDIA A6000 GPU, closely approaching OpenAI’s official GPT-2 score of 28.92%, which required extensive compute on NVIDIA V100 clusters over weeks. This project demonstrates how efficient architectural replication and strategic training schedules—inspired by the original transformer design—can yield near-benchmark performance under significant resource constraints.
Parameter | This implementation | OpenAI's 124M parameters |
---|---|---|
Iterations | 3400 | 19073 |
Warmup Steps | 132 | 715 |
Batch Size | 32 | 128 |
Total Batch Size | 524,288 | 524,288 |
Sequence Length | 512 | 1024 |
Vocabulary Size | 50304 | 50304 |