Microsoft Open Sources BitNet: Official Inference Framework for 1-Bit LLMs
Microsoft has officially released bitnet.cpp, the dedicated inference framework for 1-bit Large Language Models (LLMs) developed in collaboration with the BitNet research team. This open-source project, hosted on GitHub, enables fast and lossless inference of 1.58-bit models on both CPU and GPU hardware, with Neural Processing Unit (NPU) support planned for future iterations. The framework leverages a suite of optimized kernels built upon the llama.cpp infrastructure and incorporates Lookup Table methodologies pioneered by T-MAC to maximize efficiency. The initial release focuses on CPU performance, delivering significant speed improvements across different architectures. On ARM-based processors, users can expect speedups ranging from 1.37x to 5.07x, with larger models seeing the most dramatic gains. Additionally, energy consumption is reduced by 55.4% to 70.0%, substantially enhancing battery life for edge devices. Performance on x86 CPUs is even more impressive, showing speedups between 2.37x and 6.17x while reducing energy usage by 71.9% to 82.2%. A standout achievement of the framework is its ability to run a 100-billion parameter BitNet b1.58 model on a single CPU, achieving generation speeds of 5 to 7 tokens per second. This rate is comparable to human reading speed, marking a major step toward running large-scale language models locally on consumer hardware. Recent updates have introduced parallel kernel implementations with configurable tiling and embedding quantization support. These optimizations provide an additional 1.15x to 2.1x speedup over the original implementation across various hardware platforms and workloads. The project supports a wide array of existing 1-bit models available on Hugging Face, including the BitNet b1.58 series (ranging from 0.7B to 3.3B parameters), Llama3 variants trained with 1.58-bit precision, and several models from the Falcon3 and Falcon-E families. While GPU inference kernels were officially added in May 2025 following the initial CPU release in October 2024, the framework remains primarily optimized for CPU deployment in its first iteration. bitnet.cpp is designed to inspire further development of 1-bit LLMs in terms of model size and training tokens. Users can build the project from source, install dependencies, and run benchmarks using provided Python scripts. The software offers flexible configuration options for quantization types, thread counts, context sizes, and chat modes. Technical documentation and optimization guides are available for developers seeking to understand the underlying mechanics or contribute to the project. Acknowledgments are given to the authors of llama.cpp and the T-MAC team for their foundational work. The project aims to make advanced AI capabilities more accessible and energy-efficient for the broader community.
