Minbpe Repository
This repository is Karpathy's minbpe project repository.
There are two Tokenizers in this repository, both of which can perform the 3 main functions of a Tokenizer:
- Train the tokenizer vocabulary and merge it with the given text
- From text encoding to tokens
- Decoding from tokens to text
The original intention of the minbpe project is to create the most concise, clear and educational code for the BPE algorithm widely used in LLM. By providing two Tokenizers, the minbpe project implements the core functions of word segmentation training, encoding and decoding. This design not only improves the readability of the code, but also provides users with a more convenient and efficient operation experience.
Specifically, the minbpe project repository contains class-based Tokenizer implementations such as BaseTokenizer and BasicTokenizer. These classes are designed to provide basic functions for training, encoding, and decoding, as well as practical functions such as saving and loading. In addition, implementations such as RegexTokenizer and GPT4Tokenizer further expand the functionality of the project, providing users with more choices and possibilities.