Grass and Inference.net Launch ClipTagger-12b, a High-Accuracy Video Annotation Model Outperforming Claude 4 and GPT-4.1 at Up to 17x Lower Cost
Grass and Inference.net have announced the launch of ClipTagger-12b, a new video annotation model designed to identify actions, objects, and logos in video with high precision and detail. Built using one of the world’s largest real-world video datasets collected by Grass and trained on Inference.net’s scalable AI infrastructure, the model delivers state-of-the-art performance at a fraction of the cost of existing solutions. The model is now available via API and is already being used across diverse applications, including autonomous vehicles, warehouse robotics, and content moderation systems. In benchmark tests, ClipTagger-12b outperforms both Claude 4 and GPT-4.1 on key annotation metrics such as ROUGE and BLEU, while operating up to 17 times more efficiently in terms of cost. ClipTagger-12b was developed through a strategic collaboration between Grass and Inference.net. The model was trained on a curated subset of over 1 billion publicly available videos scraped from the web by Grass’s decentralized data collection network. These videos were processed and trained using Inference.net’s distributed compute infrastructure, which enables high-performance AI workloads without dependence on centralized cloud providers. “We believe that high-quality, low-cost AI models are possible when you combine the right data with smart engineering,” said Sam Hogan, CEO of Inference.net. “This model proves that advanced AI doesn’t have to come from just a few large labs.” Andrej Radonjic, CEO of Wynd Labs, added, “The future of AI depends on maintaining an open web and building the infrastructure to turn that open data into training fuel. ClipTagger-12b is a step in that direction—democratizing access to powerful video understanding tools.” The launch marks a significant shift in how specialized AI models are developed and deployed. By leveraging decentralized data collection and distributed compute, Grass and Inference.net are enabling startups, researchers, and developers to access cutting-edge video annotation capabilities that were once exclusive to well-funded AI labs. ClipTagger-12b is live on Inference.net, where users can integrate it into their workflows via API. The model’s weights and supporting materials are also available on Hugging Face for the research community. Researchers can apply for up to $10,000 in compute credits through the Inference.net Grants program at inference.net/grants. Grass is a user-driven platform that allows anyone to contribute their unused internet bandwidth, helping power a global network for gathering real-world data to train AI systems. Inference.net is a distributed AI compute network designed to run models at scale, offering developers a faster, more cost-effective alternative to traditional cloud infrastructure.