Apache Spark RAPIDS Qualification Tool Predicts GPU Performance for Big Data Workloads
The world of big data analytics continuously seeks ways to enhance processing speeds and reduce infrastructure costs. Apache Spark, a leading platform for scale-out analytics, is widely used for handling massive datasets across Extract, Transform, Load (ETL), machine learning, and deep learning tasks. While traditionally CPU-based, the rise of GPU acceleration promises significant performance improvements. However, migrating Spark workloads to GPUs is not a one-size-fits-all solution; factors like dataset size, data movement, and the presence of user-defined functions (UDFs) can affect performance. The Spark RAPIDS Qualification Tool To address the challenge of predicting GPU benefits for specific Spark workloads, NVIDIA introduced the Spark RAPIDS Qualification Tool. This tool analyzes CPU-based Spark applications and predicts their potential performance gains when migrated to a GPU cluster. It leverages a machine learning estimation model trained on industry benchmarks and real-world data to provide actionable insights. How It Works Input: The tool primarily uses Spark event logs from CPU-based applications as input. These logs contain critical details about the application, its executors, and the operations performed, along with relevant operating metrics. Feature Extraction: The tool parses the event logs to generate CSV files with raw features for each SQL execution ID (sqlID). Features include disk bytes spilled, maximum heap memory usage, estimated scan bandwidth, and details about query operators. Model Prediction: The extracted features are fed into a pre-trained machine learning model, which predicts the speedup an application may achieve on GPUs. The model is trained on NDS benchmark workloads and can provide estimates at the individual operator level. Output: The tool outputs several key pieces of information, including predictions for individual SQL operations and overall application performance. It helps identify which parts of an application are most likely to benefit from GPU acceleration. Running the Tool The Spark RAPIDS Qualification Tool can be run from the command line using the spark_rapids CLI. For example, you can initiate the qualification process with: spark_rapids qualification --eventlogs <file-path> --platform <platform> It is compatible with both Spark 2.x and 3.x jobs and works across various environments, including AWS EMR, Google Dataproc, Databricks (on AWS and Azure), and on-premise setups. Custom Qualification Model While the pre-trained models offer general guidance, they may not always be accurate for unique environments or specific workloads. To overcome this, the tool supports building custom qualification models: Collect Event Logs: Run your Spark applications on both CPU and GPU clusters and gather the corresponding event logs. Ensure you have CPU and GPU log pairs for each workload. Preprocess Logs: Use the Profiler tool to parse the raw event logs and generate CSV files containing raw features per sqlID. Setting the $QUALX_CACHE_DIR environment variable can optimize subsequent runs. qualx preprocess --dataset datasets Train the Model: Train a custom XGBoost model using the extracted features and observed speedups. The training process can be initiated with: spark_rapids train --dataset datasets --model custom_onprem.json --output_folder train_output Hyperparameter optimization is leveraged via Optuna, and a minimum of 100 sqlIDs is recommended for an initial model, with 1000 sqlIDs for a more reliable one. Evaluate Model: Assess the feature importance using metrics like gain, cover, frequency, and Shapley (SHAP) values. Additionally, compare predicted speedups against actual speedups using metrics like Mean Absolute Percentage Error (MAPE). Using the Custom Model Once your custom model is trained and validated, you can use it to predict speedups on new Spark applications. Supply the path to your trained model file when running the prediction command: spark_rapids predict --eventlogs <file-path> --platform <platform> --custom_model_file custom_onprem.json The output will include per-application and per-sql speedup predictions, feature values, and importance metrics. Additional Resources NVIDIA's RAPIDS Accelerator for Apache Spark facilitates seamless migration to GPUs with minimal code changes. It integrates the RAPIDS cuDF library with the Spark distributed computing framework to accelerate data processing tasks. Project Aether, a suite of tools and processes, also supports automatic qualification, testing, configuration, and optimization of Spark workloads for GPU acceleration at scale. Interested organizations can apply for this free service. Industry experts praise the Spark RAPIDS Qualification Tool for its ability to provide accurate and actionable predictions. They note that the tool significantly reduces the guesswork involved in GPU migration, allowing organizations to make informed decisions and optimize resource usage. Companies like NVIDIA, known for their expertise in GPU technology, continue to refine and expand the capabilities of this tool to meet the growing demand for efficient big data processing solutions. For more details, refer to the Spark RAPIDS user guide or watch the GTC 2025 on-demand session.