Command Palette
Search for a command to run...
Visual Language Model (VLM)
Date
A visual-language model (VLM) is an artificial intelligence model that can simultaneously understand and process image/video and text information. It can perform complex tasks such as image description, visual question answering, and image-text retrieval, and is being widely used in content analysis, intelligent assistants, robotics, and other fields.
A typical VLM architecture follows a clear three-layer information processing flow: the visual encoder (such as ViT) converts the input image into an abstract visual feature vector, the projection layer (such as a linear layer or Q-Former) aligns these visual features to the semantic space of the language model, and the large language model receives these aligned features and text instructions to perform unified understanding, reasoning and content generation.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.