Accelerating SHAP Computations on Large Datasets using python's ProcessPoolExecutor


SHAP is popular because it provides a consistent, mathematically grounded way to attribute a model’s prediction back to its input features. Unlike simple feature importance scores, SHAP explains predictions row by row, telling you why the model predicted what it did for a given customer. This makes it far more intuitive when the goal is to understand customer behaviour and design interventions.

An intuitive way to think of it: imagine a group project where the final grade is the prediction. SHAP works out how much each teammate (feature) contributed to that grade. Sometimes a teammate carries most of the load, sometimes contributions are more balanced, but SHAP ensures every teammate’s role is fairly accounted for.

Important boundary: SHAP does not reveal causal relationships. It tells us what the model relied on, not what truly caused an outcome in reality.

Two Roles of the Model

In practice, a model can serve more than one purpose. It may not only generate predictions about future outcomes, but also provide a lens into which factors the model considers most influential.

This dual use turns the model into more than a predictor—it highlights potential north star metrics for business decisions. Insights that may feel obvious in hindsight become validated and quantified at scale through SHAP.

The Performance Challenge

SHAP, especially on tree-based models like XGBoost, can be computationally heavy when applied to large datasets. Running it naively on millions of rows can take hours. In my case, computing SHAP values on a large dataset initially took around 2 hours. By rethinking the execution strategy, I reduced this to 10 minutes.

Why is SHAP+XGBoost CPU-bound?

Parallelizing SHAP with Chunking

The key idea was simple: break the dataset into manageable chunks, process each chunk in parallel, then recombine the results.

import concurrent.futures
import pandas as pd
import shap
import numpy as np

# Assume clf is a trained XGBoost model
explainer = shap.TreeExplainer(clf)
features = clf.get_booster().feature_names

# Split data into chunks
def split_dataframe(df, chunk_size):
    return [df.iloc[i:i+chunk_size] for i in range(0, len(df), chunk_size)]

# Function to compute SHAP for a chunk
def process_chunk(chunk):
    shap_values = explainer.shap_values(chunk[features])
    shap_val = pd.DataFrame(shap_values, columns=features, index=chunk.index)
    top_predictor = shap_val.apply(lambda x: x.nlargest(3).index.tolist(), axis=1)
    return top_predictor

# Parallel execution
def compute_shap_parallel(X, chunk_size=10000, num_workers=8):
    chunks = split_dataframe(X, chunk_size)
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        results = list(executor.map(process_chunk, chunks))
    return pd.concat(results)

Why Chunk Size Matters

Choosing the right chunk size is critical:

The sweet spot balances CPU utilization with memory efficiency. In practice, experimenting with different chunk sizes on the target infrastructure often reveals the optimal point.

Lessons Learned

  1. Don’t rebuild explainers unnecessarily. Construct your SHAP explainer once and reuse it across chunks.
  2. Use processes, not threads. SHAP + XGBoost is CPU-bound, so ProcessPoolExecutor is the right tool.
  3. Tune chunk size. It can dramatically impact runtime.
  4. Interpret carefully. SHAP explains model attributions, not causal reality.

Closing Thoughts

Model interpretability often comes at a computational cost. By applying parallelization with careful chunking, SHAP explanations that once took hours can be produced in minutes. More importantly, SHAP helps bridge the gap between prediction and insight—turning a machine learning model into both a forecasting tool and a guide for strategic decision-making.