Accelerating SHAP Computations on Large Datasets using python's ProcessPoolExecutor

Thu Sep 11 2025 · Yash Diniz

When working with machine learning models in production, interpretability becomes as important as accuracy. Knowing why a model made a prediction helps build trust, uncover hidden patterns, and ensure fairness. One widely adopted method for this is SHAP (SHapley Additive exPlanations). Here we parallelize SHAP using python's ProcessPoolExecutor.

SHAP is popular because it provides a consistent, mathematically grounded way to attribute a model’s prediction back to its input features. Unlike simple feature importance scores, SHAP explains predictions row by row, telling you why the model predicted what it did for a given customer. This makes it far more intuitive when the goal is to understand customer behaviour and design interventions.

An intuitive way to think of it: imagine a group project where the final grade is the prediction. SHAP works out how much each teammate (feature) contributed to that grade. Sometimes a teammate carries most of the load, sometimes contributions are more balanced, but SHAP ensures every teammate’s role is fairly accounted for.

Important boundary: SHAP does not reveal causal relationships. It tells us what the model relied on, not what truly caused an outcome in reality.

Two Roles of the Model

In practice, a model can serve more than one purpose. It may not only generate predictions about future outcomes, but also provide a lens into which factors the model considers most influential.

Prediction role → flagging potential outcomes of interest.
Attribution role → surfacing key behaviours or signals that the model aligns most strongly with those outcomes.

This dual use turns the model into more than a predictor—it highlights potential north star metrics for business decisions. Insights that may feel obvious in hindsight become validated and quantified at scale through SHAP.

The Performance Challenge

SHAP, especially on tree-based models like XGBoost, can be computationally heavy when applied to large datasets. Running it naively on millions of rows can take hours. In my case, computing SHAP values on a large dataset initially took around 2 hours. By rethinking the execution strategy, I reduced this to 10 minutes.

Why is SHAP+XGBoost CPU-bound?

XGBoost builds ensembles of decision trees. To compute SHAP values, every tree path relevant to a given row needs to be traversed.
This involves intensive, repeated calculations over many rows, which keeps all cores busy. There’s no network waiting or I/O bottleneck—it’s pure computation, which is why parallelizing across processes is effective.

Parallelizing SHAP with Chunking

The key idea was simple: break the dataset into manageable chunks, process each chunk in parallel, then recombine the results.

import concurrent.futures
import pandas as pd
import shap
import numpy as np

# Assume clf is a trained XGBoost model
explainer = shap.TreeExplainer(clf)
features = clf.get_booster().feature_names

# Split data into chunks
def split_dataframe(df, chunk_size):
    return [df.iloc[i:i+chunk_size] for i in range(0, len(df), chunk_size)]

# Function to compute SHAP for a chunk
def process_chunk(chunk):
    shap_values = explainer.shap_values(chunk[features])
    shap_val = pd.DataFrame(shap_values, columns=features, index=chunk.index)
    top_predictor = shap_val.apply(lambda x: x.nlargest(3).index.tolist(), axis=1)
    return top_predictor

# Parallel execution
def compute_shap_parallel(X, chunk_size=10000, num_workers=8):
    chunks = split_dataframe(X, chunk_size)
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        results = list(executor.map(process_chunk, chunks))
    return pd.concat(results)

Why Chunk Size Matters

Choosing the right chunk size is critical:

Too small → overhead from creating processes and merging results dominates, wasting time.
Too large → processes run out of memory, or run slower due to cache misses and memory thrashing.

The sweet spot balances CPU utilization with memory efficiency. In practice, experimenting with different chunk sizes on the target infrastructure often reveals the optimal point.

Lessons Learned

Don’t rebuild explainers unnecessarily. Construct your SHAP explainer once and reuse it across chunks.
Use processes, not threads. SHAP + XGBoost is CPU-bound, so ProcessPoolExecutor is the right tool.
Tune chunk size. It can dramatically impact runtime.
Interpret carefully. SHAP explains model attributions, not causal reality.

Closing Thoughts

Model interpretability often comes at a computational cost. By applying parallelization with careful chunking, SHAP explanations that once took hours can be produced in minutes. More importantly, SHAP helps bridge the gap between prediction and insight—turning a machine learning model into both a forecasting tool and a guide for strategic decision-making.