HyperAI

One-click Deployment of Phi-3.5-vision-instruct

Model Introduction

Phi-3.5-vision-instruct is a multimodal model in the Phi-3.5 series released by Microsoft, designed for applications that process text and visual input. The model supports a context length of 128K and has undergone a rigorous fine-tuning and optimization process, making it suitable for widespread use in commercial and research fields in environments with limited memory or computing resources and high low-latency requirements. The Phi-3.5-vision-instruct model has extensive image understanding, optical character recognition (OCR), chart and table parsing, multi-image or video clip summarization, and other functions, making it very suitable for a variety of AI-driven applications. It shows significant performance improvements in benchmarks related to image and video processing. The architecture of the model includes a 4.2 billion parameter system that integrates an image encoder, connector, projector, and Phi-3 Mini language model. The training used 256 NVIDIA A100-80G GPUs, the training time was 6 days, and the training data included 500 billion tokens (visual and text).

The Phi-3.5-vision-instruct model scored 43.0 in Multimodal Multi-Image Understanding (MMMU), demonstrating its enhanced ability to handle complex image understanding tasks. In addition, the model was trained using high-quality educational data, synthetic data, and strictly screened public documents to ensure data quality and privacy.

This tutorial can be started using a single 4090 card.

How to run

1. 克隆并成功启动容器后,等待约 10s,将鼠标悬浮在「API 地址」处,拷贝链接到新标签页打开
2. 可以看到如下界面
3. 点击上传图片,选择模型,并输入问题,点击 Submit
4. 生成结果

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓