Scaling Feature Engineering with Feast and Ray for Production ML Pipelines
In a project focused on building propensity models to predict customer purchases over a 30-day horizon, I encountered recurring challenges in feature engineering that fall into two main categories: inadequate feature management and high latency in feature processing. This article explains how to address these issues using Feast, an open-source feature store, and Ray, a distributed computing framework, to create scalable and efficient machine learning pipelines. The use case involves training and serving a 30-day customer purchase propensity model using the UCI Online Retail dataset, which contains transaction records from a UK-based online retailer between December 2010 and December 2011. The feature engineering scope is kept simple, focusing on Recency, Frequency, Monetary (RFM) metrics and customer behavioral features derived from a 90-day lookback window. For each cutoff date spaced 30 days apart, a rolling window is used to compute features and labels—where a label of 1 indicates at least one purchase in the following 30 days, and 0 means no purchase. Feast serves as a centralized feature store that manages, stores, and serves machine learning features, acting as a single source of truth for both training and inference. It supports offline (historical) and online feature retrieval, with our focus on offline features for batch prediction. The feature store integrates with various data backends and ML frameworks, enabling seamless use across cloud and on-premise environments. Ray is a distributed computing framework designed to scale machine learning workloads from a single machine to large clusters. We use Ray Core, which allows us to run Python functions as distributed tasks. This is particularly useful for parallelizing feature engineering across multiple rolling windows. To implement the pipeline, we first set up the environment by installing required dependencies, including Feast with Ray support, PostgreSQL via Docker, and ML libraries like scikit-learn and XGBoost. The dataset is cleaned and prepared for feature engineering. Next, we define a function to compute features for each cutoff date and use the @ray.remote decorator to make it a distributed task. Ray executes these tasks in parallel across available cores, significantly reducing processing time for the nine rolling windows. We then configure the Feast feature registry. This includes defining the entity (customer_id), data sources (where feature data resides), and feature views (logical groupings of features). The timestamp_field is critical for ensuring point-in-time correctness during feature retrieval. The feature registry is configured using a YAML file that specifies PostgreSQL as the metadata store and Ray as the offline store. Running feast apply registers all definitions and provisions the necessary infrastructure. For model training, we create an entity spine—a DataFrame with customer IDs and event timestamps—and use Feast to retrieve the corresponding features. The same process applies for inference, where we generate predictions using the same feature store, ensuring consistency between training and serving. By combining Feast and Ray, we achieve scalable, reusable, and efficient feature engineering. Feast ensures consistent feature definitions and reliable access, while Ray accelerates computation through parallelization. Together, they solve common challenges in production ML pipelines, enabling teams to build robust, maintainable, and high-performance systems.
