nanoVLM: Simplify Your Vision Language Model Training with Pure PyTorch
nanoVLM: The Simplest Repository to Train Your Own Vision Language Model in Pure PyTorch nanoVLM is an incredibly accessible and straightforward repository designed to help users train their own Vision Language Models (VLMs) using pure PyTorch. Inspired by Andrej Karpathy’s nanoGPT, nanoVLM simplifies the complex world of multi-modal models, making it ideal for beginners and anyone curious about the mechanics of VLMs without diving into overwhelming technical details. What is a Vision Language Model? A Vision Language Model, or VLM, combines both image and text processing capabilities. These models take images and text as input and generate text as output, enabling applications such as image captioning, object detection, and visual question answering. nanoVLM specifically focuses on Visual Question Answering (VQA), where the model answers text-based questions about images. Working with the Repository The nanoVLM repository is organized into three main directories: data, models, and the primary script train.py. Here’s a brief overview of the codebase: Data Directory: Contains scripts like collators.py, datasets.py, and processors.py for managing and preparing the data. Models Directory: Houses the core model files including vision_transformer.py (for the vision backbone), language_model.py (for the language backbone), and vision_language_model.py (the main VLM class). train.py Script: This is the entry point for training your VLM. It handles configuration, data loading, model initialization, optimizer setup, training loop, logging, and model saving. Architecture Overview nanoVLM leverages two well-known architectures for its vision and language backbones: Vision Backbone: Utilizes Google’s SigLIP vision encoder, implemented in vision_transformer.py. Language Backbone: Follows the Llama 3 architecture, implemented in language_model.py. The two modalities are aligned using a Modality Projection module, which transforms image embeddings into a format compatible with text embeddings. This module consists of a pixel shuffle operation followed by a linear layer, effectively reducing the number of image tokens and lowering computational costs. The combined embeddings are then fed into the language decoder. Training Your VLM To start training your VLM, you can simply run: bash python train.py Configuration The training script begins by loading configuration classes from models/config.py. These classes define hyperparameters and settings essential for the training process. Data Loading The data pipeline is managed by the get_dataloaders function. It prepares the dataset and can be configured with a data_cutoff_idx for debugging on smaller subsets. Model Initialization The VisionLanguageModel class is used to build the model. If you are resuming from a checkpoint, you can initialize the model with: ```python from models.vision_language_model import VisionLanguageModel model = VisionLanguageModel.from_pretrained(model_path) ``` Alternatively, you can initialize a fresh model, optionally preloading existing backbones for vision and language. Optimizer Setup The optimizer is configured with two learning rates (LRs) due to the different initial states of the modality projector (MP) and the pre-trained backbones. This ensures that the MP can learn quickly while maintaining the integrity of the pre-existing knowledge in the vision and language models. Training Loop The training loop includes periodic evaluations on validation and MMStar test datasets every 250 steps. If performance improves, the model is checkpointed. Logging and monitoring are handled by Weights & Biases, with metrics like batch loss, validation loss, accuracy, and tokens per second being tracked. Runs are automatically named using relevant metadata such as sample size, batch size, epoch count, learning rates, and the current date. Pushing to the Hub After training, you can save and push your model to the Hugging Face Hub for others to use and test: python model.save_pretrained(save_path) model.push_to_hub("hub/id") Running Inference on a Pre-Trained Model Once your model is trained, you can use the generate.py script to perform inference: bash python generate.py --image path/to/image.png --prompt "Your prompt here" This script initializes the model, sets it to evaluation mode, processes the input image and text prompt, and generates the output text. An example of its usage might involve asking "What is this?" about an image of two cats lying on a bed: Input: What is this? Output: In the picture, I can see the pink color bed sheet. I can see two cats lying on the bed sheet. For a user-friendly interface, the authors have created a Hugging Face Space where you can test the model interactively. Conclusion nanoVLM provides a lightweight and readable codebase for building and training VLMs, particularly focusing on Visual Question Answering. Its simplicity makes it an excellent educational tool for understanding the alignment of multi-modal inputs and a foundational resource for those looking to develop their VLMs on custom datasets. Whether you’re a beginner or an experienced developer, nanoVLM offers a streamlined approach to exploring and enhancing multi-modal AI. Industry Insights and Company Profiles Industry experts have praised nanoVLM for its accessibility and educational value. It democratizes the process of training VLMs, reducing the barrier to entry and encouraging more developers to experiment with multi-modal models. Hugging Face, known for advancing the state of natural language processing and machine learning, continues to innovate by providing tools like nanoVLM that simplify complex tasks. The repository’s reliance on established architectures like SigLIP and Llama 3, coupled with its modular design, allows for easy customization and scalability. This approach not only accelerates learning but also fosters a community of developers who can contribute and build upon the foundation provided.
