YOLOE: See Everything in Real Time

1. Tutorial Introduction
YOLOE is a new real-time visual model proposed by a research team from Tsinghua University in 2025, which aims to achieve the goal of "seeing everything in real time". It inherits the real-time and efficient characteristics of the YOLO series of models, and on this basis deeply integrates zero-shot learning and multimodal prompting capabilities, and can support target detection and segmentation in multiple scenarios such as text, vision, and no prompts. The related paper results are "YOLOE: Real-Time Seeing Anything".
Core Features
- Any text type
- Multimodal prompts:
- Visual cues (boxes/dots/hand-drawn shapes/reference images)
- Visual cues (boxes/dots/hand-drawn shapes/reference images)
- Fully automatic silent detection – Automatically identify scene objects
Demo environment: YOLOv8e/YOLOv11e series + RTX4090
2. Operation steps
1. After starting the container, click the API address to enter the Web interface
If "Bad Gateway" is displayed, it means the model is initializing. Please wait for about 1-2 minutes and refresh the page.

2. YOLOE function demonstration
1. Text prompt detection
- Any text type
- Custom prompt words: Allows the user to enter arbitrary text (recognition results may vary depending on semantic complexity)


2. Multimodal visual cues
- 🟦 Box selection detection (bboxes)
bboxes: For example, if you upload an image containing many people and want to detect people in the image, you can use bboxes to frame one person. During inference, the model will identify all the people in the image based on the content of the bboxes.
Multiple bboxes can be drawn to get more accurate visual cues. - ✏️ Click/draw area (masks)
Masks: For example, if you upload an image containing many people and want to detect people in the image, you can use masks to cover one person. During inference, the model will recognize all the people in the image based on the content of the masks.
You can draw multiple masks to get more accurate visual cues. - 🖼️ Reference image comparison (Intra/Cross)
Intra: Operate bboxes or masks on the current image and perform inference on the current image.
Cross: Operate bboxes or masks on the current image and infer on other images.
Core Concepts
model | Functional Description | Application Scenario |
---|---|---|
Intra-image | Modeling object relationships within a single graph | Local target precise positioning |
Cross-image | Cross-image feature matching | Similar object retrieval |



3. Fully automatic detection without prompting
- 🔍 Intelligent scene analysis: Automatically identify all salient objects in an image
- 🚀 Zero configuration startup: Works without any prompt input


Exchange and discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓
