Facebook's MILS: In-House AI Model Parses Docs, Replacing Gemini, OpenAI

In this article, the author provides a detailed guide on deploying an open-source visual language model (VLM) for document parsing using AWS Batch. The solution leverages Alibaba's Qwen-2.5-VL model and the vLLM inference tool, offering a cost-effective and secure alternative to external LLM providers like OpenAI's GPT or Google's Gemini. Event Background and Cause Most companies developing AI capabilities rely on external LLM providers, which can be expensive. As the open-source community grows, companies can now build internal LLM functionalities, reducing costs and enhancing data privacy. This approach is particularly beneficial for companies that have completed proof-of-concept (POC) stages but struggle with the high cost of commercial models. Solution Design and Implementation Model Selection Alibaba’s Qwen-2.5-VL model, based on a multimodal Transformer architecture, was chosen for its ability to generate structured outputs, making it ideal for document parsing. Other lightweight models like SmolVLM and Idefics3 were considered but not selected. vLLM Inference Acceleration vLLM optimizes memory allocation through PagedAttention, enabling the model to handle multiple requests simultaneously. It supports various models, including VLMs, and can load any model from the Hugging Face platform for GPU processing. AWS Batch Deployment The team used AWS Batch to manage and run batch tasks, which charges based on actual usage time, thus avoiding resource waste. They detailed the deployment process using Terraform scripts to create IAM roles, configure compute environments, and define job queues. The system runs on an EC2 instance equipped with an Nvidia L4 GPU, providing 16GB of RAM and approximately 24GB of VRAM. Technical Implementation Steps Document Downloading: Documents in image format are downloaded from S3 buckets, identified by unique S3 paths. Model Loading and Preparation: Qwen-2.5-VL is loaded from Hugging Face using vLLM, and parameters such as resource utilization, maximum sequence length, and number of requests are configured. The model's output is guided into a JSON format through instructional decoding. Inference Execution: Images and instructions are sent to the model for batch inference. Guided decoding is used to enhance data quality. Output Parsing and Validation: The generated JSON outputs are extracted and validated using Pydantic models to prevent potential issues from the model's "hallucinations." Result Storage: Parsed data is associated with the original document's unique identifier, stored as datasets, and saved back to S3 buckets. Results and Evaluation Testing showed that the average processing time per file is approximately 4.5 seconds. For 10,000 documents, the total cost is around $10, taking about 12.5 hours. This approach is more economical and secure compared to commercial LLM providers. The use of open-source models also allows for future customization and performance improvements. Industry Opinion and Company Background Industry experts praise this approach for its cost control, data privacy, and flexibility. Open-source solutions like those from Alibaba are helping more companies benefit from advanced AI technologies. Alibaba is a leading player in the AI field, continuously contributing to the open-source ecosystem. Facebook Research recently published a groundbreaking study demonstrating that large language models (LLMs) can process image and audio tasks without specific training. The code for this research is open-sourced on GitHub, highlighting the versatility of LLMs. Research Context and Methodology Facebook Research conducted experiments showing that certain LLMs can generate image descriptions, create text for audio clips, and even produce high-quality images using their generalization capabilities. Multiple datasets were used to validate these findings, including MS-COCO, Clotho, MSR-VTT, and ViClip-InternVid-10M-FLT.pth. Running Environment and Steps The project includes detailed installation instructions, suggesting the use of a conda environment. Tasks involve generating descriptions for images and audio clips, saving the outputs, and computing quality metrics. Additional tasks include high-resolution image generation and style transfer. Community Feedback and Contributions Researchers encourage community participation for testing and improvement. Issues can be reported via GitHub, and technical questions directed to Ashutosh Kumar. The project is released under the CC-by-NC 4.0 license, with some third-party content under independent licenses. Industry Expert Evaluation Experts view this research as a significant step forward, showcasing the multifunctional potential of LLMs and offering new avenues for cross-modal applications. Facebook Research's leadership in NLP and machine learning continues with this innovative work. Creating personalized and dynamic prompts can greatly enhance interactions with large language models (LLMs) by making them more contextual and efficient. Static prompts, while consistent, often fail to meet user needs. Dynamic prompts, on the other hand, adapt to the user's context, improving both user experience and response quality. Prompt Techniques Contextual Construction: Detailed descriptions of user scenarios help the model understand background information, providing highly customized responses. However, maintaining complex scenario descriptions can be challenging. Template-Based: Standardized templates are filled with relevant content for specific situations. This method is easy to manage and scale but may lack flexibility. Orchestration: Combines contextual construction and template-based techniques, dynamically adjusting and combining prompt fragments for richer interactions. Orchestration is currently seen as the best practice. Research Findings Studies show that orchestration techniques perform the best, adapting to various situations and dynamically refining prompt content. This results in more accurate and useful AI responses, especially for complex queries, multi-turn dialogues, and cross-domain information requests. Although dynamic prompts require more computational resources and complex design, the benefits far outweigh the costs. Industry Expert Evaluation Dynamic prompt techniques are expected to become increasingly crucial in enterprise applications. They help companies stand out in competitive markets by providing personalized services. With ongoing advancements in LLMs, companies like AliCloud and Baidu will need to invest in optimizing algorithms and data handling to stay ahead. ChatGPT's "Projects" Feature OpenAI introduced a "projects" feature in ChatGPT, enhancing its organizational capabilities and making it a powerful digital workspace. Users can create multiple projects, each with related chat history, uploaded files, and unique instructions. Usage and Benefits Creating Projects: Organize tasks and collaborate efficiently. Organizing Chat History: Classify and store relevant conversations. Uploading Files: Attach and reference necessary documents. Setting Instructions: Define unique contexts and commands to improve AI's understanding and efficiency. Practical Applications Code Development: Share code snippets, discuss issues, and record solutions. Market Analysis: Upload market data, generate reports, and suggest strategies. Education: Manage course materials and student interactions. Content Creation: Draft, collect references, and refine content. Industry Expert Evaluation ChatGPT's "projects" feature marks a significant upgrade for AI assistants, enhancing both individual productivity and team collaboration. This development highlights the growing importance of AI in productivity tools. Automating Everyday Tasks with Python AI automation goes beyond simple button-clicking, offering solutions that can significantly boost efficiency and productivity. Here are five practical examples using Python: Automatic Meeting Scheduling: Capture meeting requests, parse details, and automatically select suitable time slots. Send confirmations or suggestions to participants. File Management Automation: Identify and archive different file types, and implement search functions for quick access. Customer Support Chatbots: Automate common inquiries, optimize responses through deep learning, and escalate complex issues to human agents. Automated Data Analysis: Clean and organize data, analyze trends, and provide actionable insights. Smart Home Control Systems: Adjust home settings based on user habits, enable natural interaction, and integrate various devices for seamless management. Industry Expert Evaluation These AI automation solutions offer tangible benefits, especially for small and medium businesses. They help reduce costs, increase operational efficiency, and improve competitiveness. As technology advances, the scope of AI in automation will expand, encouraging companies to explore suitable applications.

Facebook's MILS: In-House AI Model Parses Docs, Replacing Gemini, OpenAI

Related Links