Command Palette
Search for a command to run...
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Abstract
We introduce SmolDocling, an ultra-compact vision-language model targetingend-to-end document conversion. Our model comprehensively processes entirepages by generating DocTags, a new universal markup format that captures allpage elements in their full context with location. Unlike existing approachesthat rely on large foundational models, or ensemble solutions that rely onhandcrafted pipelines of multiple specialized models, SmolDocling offers anend-to-end conversion for accurately capturing content, structure and spatiallocation of document elements in a 256M parameters vision-language model.SmolDocling exhibits robust performance in correctly reproducing documentfeatures such as code listings, tables, equations, charts, lists, and moreacross a diverse range of document types including business documents, academicpapers, technical reports, patents, and forms -- significantly extending beyondthe commonly observed focus on scientific papers. Additionally, we contributenovel publicly sourced datasets for charts, tables, equations, and coderecognition. Experimental results demonstrate that SmolDocling competes withother Vision Language Models that are up to 27 times larger in size, whilereducing computational requirements substantially. The model is currentlyavailable, datasets will be publicly available soon.
Code Repositories
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.