SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

While frontier large language models (LLMs) continue to push capabilityboundaries, their deployment remains confined to GPU-powered cloudinfrastructure. We challenge this paradigm with SmallThinker, a family of LLMsnatively designed - not adapted - for the unique constraints of local devices:weak computational power, limited memory, and slow storage. Unlike traditionalapproaches that mainly compress existing models built for clouds, we architectSmallThinker from the ground up to thrive within these limitations. Ourinnovation lies in a deployment-aware architecture that transforms constraintsinto design principles. First, We introduce a two-level sparse structurecombining fine-grained Mixture-of-Experts (MoE) with sparse feed-forwardnetworks, drastically reducing computational demands without sacrificing modelcapacity. Second, to conquer the I/O bottleneck of slow storage, we design apre-attention router that enables our co-designed inference engine to prefetchexpert parameters from storage while computing attention, effectively hidingstorage latency that would otherwise cripple on-device inference. Third, formemory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism toslash KV cache requirements. We release SmallThinker-4B-A0.6B andSmallThinker-21B-A3B, which achieve state-of-the-art performance scores andeven outperform larger LLMs. Remarkably, our co-designed system mostlyeliminates the need for expensive GPU hardware: with Q4_0 quantization, bothmodels exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GBand 8GB of memory respectively. SmallThinker is publicly available athf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct andhf.co/PowerInfer/SmallThinker-21BA3B-Instruct.