Volga's On-Demand Compute Layer Benchmarked: Low-Latency Feature Serving and Scalability on EKS
Real-time machine learning systems demand not only efficient models but also robust infrastructure capable of handling low-latency feature serving under dynamic load conditions. This article focuses on benchmarking Volga’s On-Demand Compute Layer, a key component designed for real-time feature computation, deployed on Amazon Elastic Kubernetes Service (EKS) and orchestrated with Ray. The evaluation includes performance metrics, latency analysis, and horizontal scalability across various load profiles, highlighting how architectural choices impact system robustness and efficiency. Background Volga is a sophisticated data processing system tailored for modern AI/ML pipelines, emphasizing real-time feature engineering. It comprises two core components: the Streaming Engine and the On-Demand Compute Layer. The On-Demand Compute Layer is a stateless service driven by Ray Actors, with each worker running a Starlette server to execute user-defined logic on data generated by the Streaming Engine or serve precomputed data. The entire system is managed by an AWS Application Load Balancer, which routes requests to the EKS nodes hosting Volga pods. These pods then distribute the load among workers within the same node/pod. Test Setup The load tests were conducted on an Amazon EKS cluster using t2.medium instances (2 vCPUs, 4 GB RAM), hosting both the Locust deployment for load testing and the Ray cluster running Volga. Each Ray pod was mapped to a single EKS node to ensure resource isolation. The On-Demand Compute Layer was set up behind an AWS Application Load Balancer, serving as the primary target for Locust workers. For detailed setup instructions, refer to the volga-ops repository. Resource Sizing Given that Volga’s workers are lightweight, single-threaded Python processes, the resource estimation follows a simple 1 worker = 1 CPU mapping. This approach ensures efficient use of computational resources without overloading individual nodes. Storage Configuration Redis was chosen as the intermediate storage layer between the streaming engine and the on-demand layer due to its simplicity and high performance. However, in production environments, storage systems like ScyllaDB, Cassandra, or DynamoDB are recommended for their stronger consistency guarantees and better durability. Benchmarks and Results Features to Calculate/Serve The benchmarks emulated a simple real-time feature pipeline with the following components: 1. Streaming Pipeline Feature: A source function test_feature that produces mock data. 2. Request-Time Transformation: An on-demand function simple_feature that multiplies the value from test_feature by a given multiplier. Metrics Measured During each load test, the following metrics were monitored: - End-to-End Request Latency: Time taken for a complete request cycle. - Storage Read Latencies: Time taken to fetch data from Redis. - CPU Utilization: Via AWS Container Insights to ensure maximum node usage. The tests followed a stepwise load increase pattern, where Requests Per Second (RPS) grew incrementally every 20 seconds over a 3-minute period until a configured maximum. Locust’s internal backpressure mechanism halted RPS growth if latency exceeded acceptable thresholds. Max RPS Test In the largest configuration with 80 workers: - Maximum Achievable RPS: Over 5,000 RPS with average latencies well below 100 milliseconds. - End-to-End Latency: Consistently low, even under heavy load. - Storage Read Latency: Below 5 milliseconds, demonstrating Redis’s performance. Horizontal Scalability Tests were conducted with 4, 10, 20, 40, 60, and 80 workers, tracking sustainable RPS and corresponding latency metrics. The results indicated: - Linear Scalability: The system demonstrated near-linear increases in RPS as the number of workers increased, maintaining low latencies throughout. - Latency under Increasing Load: Latency remained stable and did not spike significantly, even as RPS reached peak levels. - Efficient Resource Utilization: As workers increased, CPU utilization was balanced across nodes, ensuring no single node became a bottleneck. Conclusion Volga’s On-Demand Compute Layer is a powerful tool for constructing real-time AI/ML feature engineering pipelines. It significantly reduces the need for custom glue code, ad-hoc data models, and manual API abstractions, streamlining the development process. The benchmarks clearly show that the On-Demand Layer: - Handles High RPS: Successfully manages over 5,000 RPS with minimal latency. - Scales Horizontally: Demonstrates near-linear scalability across varying worker counts. - Ensures Low Latency: Maintains consistent and low latencies, even under heavy loads. To further enhance reliability and performance in production environments, integrating advanced storage backends like ScyllaDB, Cassandra, or DynamoDB is essential. This setup can provide stronger consistency and better durability, ensuring that the On-Demand Compute Layer performs optimally in real-world applications. Industry insiders have praised Volga’s innovative approach to simplifying real-time feature engineering, noting its potential to accelerate the deployment of AI/ML systems in industries such as finance, healthcare, and e-commerce. The project, open-source and actively maintained, offers a flexible and scalable solution for developers looking to streamline their real-time data processing workflows. Company Profile Volga is an open-source project developed by a team of experienced software engineers and data scientists. The platform is designed to address the challenges of real-time data processing and feature engineering, making it easier for developers to build and deploy robust AI/ML pipelines. By leveraging cutting-edge technology like Kubernetes and Ray, Volga aims to democratize access to advanced data processing capabilities. Contributions to the project are welcomed and encouraged. For more information or to get involved, visit the Volga GitHub repository.