HyperAI

Google has introduced Agentic Vision in Gemini 3 Flash, a new capability that enhances image understanding by combining visual reasoning with code execution. Unlike traditional AI models that analyze images in a single static view, Agentic Vision treats vision as an active, iterative process. This allows the model to investigate images step by step, zooming in on fine details, manipulating visuals, and grounding its answers in actual visual evidence. The core of Agentic Vision lies in a Think, Act, Observe loop. First, the model analyzes the user’s query and the initial image to create a multi-step plan. Then, it generates and runs Python code to perform actions such as cropping, rotating, annotating, or analyzing image data. Finally, the modified image is added back into the model’s context, enabling it to observe the results and refine its reasoning before delivering a response. This integration of code execution boosts performance across most vision benchmarks by 5 to 10 percent, offering more accurate and reliable outcomes. The feature is already being used by developers and companies to unlock new applications. One key use case is zooming and inspecting. For example, PlanCheckSolver.com, a platform that validates building plans, improved its accuracy by 5 percent by using Agentic Vision to iteratively crop and analyze high-resolution sections of blueprints. The model generates code to focus on specific areas like roof edges or structural components, then uses the resulting images to verify compliance with complex building codes. Another application is image annotation. Instead of merely describing what it sees, Gemini 3 Flash can now draw bounding boxes and labels directly on images. In a demo, the model counted digits on a hand by marking each finger with a box and a number, creating a visual scratchpad that ensures precision and reduces errors. Agentic Vision also excels in visual math and data visualization. When presented with dense tables or complex data, the model extracts the raw information, runs deterministic Python code to process it—such as normalizing values—and generates accurate charts using libraries like Matplotlib. This eliminates the hallucinations common in standard LLMs during multi-step calculations. Looking ahead, Google plans to make these behaviors more implicit—so the model will automatically decide when to zoom, rotate, or compute without requiring explicit prompts. Additional tools like web search and reverse image search are also under development to further ground the model’s understanding. The capability will be expanded to other model sizes beyond Flash in the future. Developers can access Agentic Vision today through the Gemini API in Google AI Studio and Vertex AI. It is also being rolled out in the Gemini app, available when selecting the Thinking mode. Users can explore the feature via the AI Studio Playground by enabling Code Execution under Tools. Detailed documentation is available for developers on the Vertex AI developer site.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

Gemini 3 Flash Unveils Agentic Vision: AI Now Thinks, Acts, and Observes Images with Code Execution for Smarter Visual Reasoning

Related Links

Command Palette

Gemini 3 Flash Unveils Agentic Vision: AI Now Thinks, Acts, and Observes Images with Code Execution for Smarter Visual Reasoning

Related Links

Command Palette

Gemini 3 Flash Unveils Agentic Vision: AI Now Thinks, Acts, and Observes Images with Code Execution for Smarter Visual Reasoning

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models