Falcon Perception unveils new AI perception system
The Technology Innovation Institute (TII) in Abu Dhabi has unveiled Falcon Perception and Falcon OCR, two new open-vocabulary AI models that challenge traditional perception system architectures. Unlike modular pipelines that combine separate vision backbones with language decoders, these models utilize a single early-fusion Transformer backbone to process image patches and text tokens simultaneously. Falcon Perception is a 0.6 billion parameter model designed for grounding and segmentation via natural language prompts. It employs a hybrid attention mask that allows the same network to function as a bidirectional visual encoder for images while supporting autoregressive prediction for task tokens. The model generates outputs through a structured Chain-of-Perception interface, predicting coordinates, size, and segmentation masks in a fixed sequence. This design enables variable output lengths, allowing the system to handle scenes containing zero to hundreds of instances without running out of query tokens. On the SA-Co benchmark, Falcon Perception achieved a Macro-F1 score of 68.0, outperforming the existing SAM 3 model which scored 62.3. The model showed significant gains in attribute-heavy, food, and sports equipment categories, though it currently lags behind SAM 3 in presence calibration accuracy. To better evaluate performance on complex tasks, the team introduced PBench, a diagnostic benchmark that categorizes capabilities from simple object recognition to dense, crowded scene analysis. Results indicate that Falcon Perception excels where compositional prompts require Optical Character Recognition (OCR) guidance, spatial understanding, or relational reasoning, significantly outperforming generalist Vision Language Models despite its smaller parameter count. For example, in OCR-guided tasks, Falcon Perception correctly identified objects labeled with specific text where others failed, and in dense scenes, it successfully segmented hundreds of instances. The second release, Falcon OCR, is a 0.3 billion parameter model tailored for document understanding. It leverages the same early-fusion architecture but is trained from scratch to optimize for fine-grained glyph recognition rather than object segmentation. Falcon OCR achieved 80.3% on the olmOCR benchmark and 88.6% on OmniDocBench, outperforming much larger proprietary systems in multi-column and table extraction tasks. Its compact size allows for high serving throughput, processing up to 5,825 tokens per second on a single A100 GPU. Both models are trained on massive datasets comprising millions of images and billions of expressions, utilizing a multi-stage training regimen that includes multi-teacher distillation, hard negative mining, and staged curriculum learning. The TII team argues that these results demonstrate the viability of a unified single-stack Transformer approach, suggesting that future improvements will come from enhanced data and training signals rather than increasingly complex modular pipelines. The models are now available as open-source software with Docker and MLX integration for deployment on various hardware platforms.
