Date

8 months ago

Size

776.37 MB

Project Overview

The Describe Anything Model (DAM) is an innovative image and video description model jointly developed by teams from NVIDIA, UC Berkeley, and UCSF, and released in 2025. This model can generate detailed descriptions based on user-specified regions (points, boxes, scribbles, or masks). For video content, a complete description can be obtained simply by annotating regions on any frame. Related research papers are available. Describe Anything: Detailed Localized Image and Video Captioning .

This tutorial uses resources for a single RTX 4090 card.

Project Examples

Run steps

1. After starting the container, click the API address to enter the Web interface

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. Once you enter the web page, you can interact with the model

The image size should not exceed 5 MB, the video length should not exceed 20 seconds, and the video size should not exceed 5 MB, otherwise it may cause the model to run slowly or report an error. Please select the area for description reasonably.

This tutorial provides two module tests: image mode and video mode modules.

The functions of each module are as follows:

Image Mode

Video Mode

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Citation Information

Thanks to Github user zhangjunchang For the deployment of this tutorial, the project reference information is as follows:

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming-Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
} GitHub Stars arXiv

This notebook is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Notebooks

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

3 months ago

PaddleOCR-VL: Multimodal Document Parsing

3 months ago

ROCKET-2: 3D Game Zero-Shot Transfer

2 months ago

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

Date

8 months ago

Size

776.37 MB

Project Overview

This tutorial uses resources for a single RTX 4090 card.

Project Examples

Run steps

1. After starting the container, click the API address to enter the Web interface

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. Once you enter the web page, you can interact with the model

The image size should not exceed 5 MB, the video length should not exceed 20 seconds, and the video size should not exceed 5 MB, otherwise it may cause the model to run slowly or report an error. Please select the area for description reasonably.

This tutorial provides two module tests: image mode and video mode modules.

The functions of each module are as follows:

Image Mode

Video Mode

Exchange and discussion

Citation Information

Thanks to Github user zhangjunchang For the deployment of this tutorial, the project reference information is as follows:

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming-Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
} GitHub Stars arXiv

Related Notebooks

Depth-Anything-3: Restoring Visual Space From Any Perspective

2 months ago

LongCat-Video: Meituan's open-source AI Video Generation Model

3 months ago

Krea-realtime-video: Real-time Video Generation Model

2 months ago

SAM3: Visual Segmentation Model

2 months ago

Open-AutoGLM: Smart Assistant for Mobile Devices

2 months ago

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

3 months ago

PaddleOCR-VL: Multimodal Document Parsing

3 months ago

ROCKET-2: 3D Game Zero-Shot Transfer

2 months ago

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

2 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Describe Anything Model Demo

Project Overview

Project Examples

Run steps

Exchange and discussion

Citation Information

Build AI with AI

HyperAI Newsletters

Command Palette

Describe Anything Model Demo

Project Overview

Project Examples

Run steps

Exchange and discussion

Citation Information

Related Notebooks

Depth-Anything-3: Restoring Visual Space From Any Perspective

LongCat-Video: Meituan's open-source AI Video Generation Model

Krea-realtime-video: Real-time Video Generation Model

SAM3: Visual Segmentation Model

Open-AutoGLM: Smart Assistant for Mobile Devices

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

ROCKET-2: 3D Game Zero-Shot Transfer

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

Build AI with AI

HyperAI Newsletters

Command Palette

Describe Anything Model Demo

Project Overview

Project Examples

Run steps

Exchange and discussion

Citation Information

Related Notebooks

Depth-Anything-3: Restoring Visual Space From Any Perspective

LongCat-Video: Meituan's open-source AI Video Generation Model

Krea-realtime-video: Real-time Video Generation Model

SAM3: Visual Segmentation Model

Open-AutoGLM: Smart Assistant for Mobile Devices

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

ROCKET-2: 3D Game Zero-Shot Transfer

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

Build AI with AI

HyperAI Newsletters

Related Notebooks

Depth-Anything-3: Restoring Visual Space From Any Perspective

LongCat-Video: Meituan's open-source AI Video Generation Model

Krea-realtime-video: Real-time Video Generation Model

SAM3: Visual Segmentation Model

Open-AutoGLM: Smart Assistant for Mobile Devices

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

ROCKET-2: 3D Game Zero-Shot Transfer

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX

Related Notebooks

Depth-Anything-3: Restoring Visual Space From Any Perspective

LongCat-Video: Meituan's open-source AI Video Generation Model

Krea-realtime-video: Real-time Video Generation Model

SAM3: Visual Segmentation Model

Open-AutoGLM: Smart Assistant for Mobile Devices

MonkeyOCR: Document Parsing Based on the structure-recognition-relation Triple Paradigm

PaddleOCR-VL: Multimodal Document Parsing

ROCKET-2: 3D Game Zero-Shot Transfer

Supertonic: A high-speed TTS Speech Synthesis Model Based on ONNX