MicroGPT: A 200-Line Python Implementation of GPT from Scratch with Autograd, Training, and Inference
microgpt is a single-file, 200-line Python script that implements a complete, functional GPT-like language model from scratch—no dependencies, no external libraries. It contains everything needed to train and generate text: a dataset, tokenizer, autograd engine, a GPT-2-inspired neural network, the Adam optimizer, a training loop, and an inference loop. The entire system is built on pure Python and scalar operations, making it a minimal yet fully working representation of how large language models work at their core. The project uses a simple dataset of 32,000 names, one per line. Each name is treated as a document, and the model learns statistical patterns in sequences of characters. The goal is to generate new, plausible-sounding names by predicting the next character at each step. The tokenizer maps each unique character (a–z) and a special BOS (Beginning of Sequence) token to an integer ID, creating a vocabulary of 27 tokens. Text is converted into sequences of IDs, and the model learns to predict the next ID in the sequence. At the heart of microgpt is a custom autograd system implemented in a Value class. This class tracks scalar values and their gradients through operations like addition, multiplication, exponentiation, and activation functions such as ReLU. The backward() method computes gradients using the chain rule by traversing the computation graph in reverse topological order. This is the same algorithm used by PyTorch, but applied to individual scalars instead of tensors—making it simpler and more transparent. Model parameters are initialized as random values and organized into a dictionary (state_dict) that holds embedding tables, attention weights, MLP weights, and output projections. The model architecture follows a simplified GPT-2 design: it uses position embeddings, multi-head self-attention with RMSNorm, residual connections, and a two-layer MLP per block. The model processes one token at a time, maintaining a KV cache of past keys and values to enable attention over previous positions. During training, the model processes each name in sequence, computing the loss as the negative log probability of the correct next token. The loss is averaged across all positions in the sequence. Backward propagation computes gradients for every parameter, and the Adam optimizer updates them using momentum and adaptive learning rates. Over 1,000 steps, the loss drops from around 3.3 (random guessing) to about 2.37, showing that the model has learned meaningful patterns. After training, inference begins by sampling from the model’s output distribution. Starting with BOS, the model predicts the next token based on its current state and the KV cache. The output is passed back in as input, and the process repeats until BOS is generated again or the sequence reaches the maximum length. A temperature parameter controls randomness: lower values make the model more conservative, higher values increase diversity. The script is designed to be educational and modular. It was built incrementally through a series of versions—starting from a bigram model, then adding MLPs, autograd, attention, and finally the full GPT architecture. A GitHub Gist tracks these steps, allowing readers to see how the code evolves. While microgpt is tiny and slow compared to production models like ChatGPT, it captures the essential algorithmic core of modern LLMs. Real-world systems scale this up with massive datasets (trillions of tokens), subword tokenizers (like BPE), GPU-accelerated tensor operations, larger models (hundreds of billions of parameters), batched training, and post-training via fine-tuning and reinforcement learning. Despite the differences in scale and engineering, the fundamental process remains the same: predict the next token in a sequence based on past context, using a neural network trained to minimize prediction error. microgpt proves that the entire idea of a language model can be distilled into a few hundred lines of code—making it a powerful tool for understanding how AI generates text.
