HyperAIHyperAI

Command Palette

Search for a command to run...

WindVE: Optimizing Vector Embedding by Enhancing CPU-NPU Concurrent Processing Capabilities

Retrieval-Augmented Generation (RAG) is a method that enhances the performance of large language models by integrating information retrieval techniques. In the industrial sector, inference services based on large language models are highly cost-sensitive, driving the need to optimize hardware resource utilization. Specifically, vector embedding and retrieval processes can account for up to 20% of the total latency. Therefore, optimizing computational resources in vector embedding is crucial for improving the cost-effectiveness of inference processes, which in turn boosts product competitiveness. This paper delves into the deployment costs associated with vector embedding technology in inference services. It introduces a theoretical formula and uses mathematical expressions to demonstrate that increasing the capacity to handle concurrent queries is key to reducing deployment costs. The focus is on how to enhance the system's ability to process multiple queries simultaneously without sacrificing performance. To achieve this, a queue manager was designed to efficiently offload peak queries from the CPU to other processors. This manager utilizes a linear regression model to determine the optimal queue depth, a critical factor that significantly affects system efficiency. Furthermore, a system named WindVE has been developed. WindVE adopts a CPU-NPU heterogeneous architecture to offload peak concurrent queries, leveraging the performance differences between the two processors to effectively manage sudden increases in traffic. Experimental results compared WindVE with the state-of-the-art vector embedding framework, FlagEmbedding. The findings showed that WindVE's concurrency handling capability increased by up to 22.3% compared to solutions without offloading. This improvement highlights the effectiveness of WindVE in enhancing the cost-efficiency and performance of inference services.

Related Links