Partial dependence plots

The standard interpretability patch for tree ensembles. Boosted forests / XGBoost models are individually opaque (“we have a model that we can’t understand” - L21-unsupervised-1); PDPs claw qualitative interpretability back by collapsing the model onto one predictor at a time, marginalizing the rest. The book’s empirical estimator is just “average the prediction over all observations with $X_{j}$ held fixed.” Fair-game on the exam to ask what they show.

Definition (prof’s framing)

“The partial dependence function represents the effects of $X_{j}$ on $f (X)$ after accounting for the effects of all the other variables. And that’s of course different than… what’s the effect of $X$ if everything else was not there. That’s a totally different question.” - L21-unsupervised-1

The mathematical ideal: $f_{j} (x_{j}) = E_{X_{- j}} [f (x_{j}, X_{- j})]$

marginalize over the joint of the other predictors $X_{- j}$ at $X_{j} = x_{j}$ .

The book’s empirical estimator (the one you actually compute and the one you’d write on the exam): $\overset{ˉ}{f}_{j} (x_{j}) = \frac{1}{N} \sum_{i = 1}^{N} f (x_{j}, x_{i, - j}) .$

For a chosen sweep of $X_{j}$ values, evaluate the trained model with $X_{j} = x_{j}$ but the other coordinates set to each observation’s own values, then average.

Notation & setup

$X_{j}$ : the one predictor you’re interested in.
$X_{- j}$ : all other predictors, treated as a vector.
$f (\cdot)$ : the trained black-box (boosted ensemble, RF, etc.).
$\overset{ˉ}{f}_{j} (x_{j})$ : the empirical partial dependence at $X_{j} = x_{j}$ .

How you’d compute one (pseudocode-ish, exam-friendly)

For a chosen variable $X_{j}$ :

Pick a grid of values $x_{j}^{(1)}, \dots, x_{j}^{(K)}$ over the range of $X_{j}$ (e.g. quantiles of the training data).
For each grid value $x_{j}^{(k)}$ :
- For each training observation $i = 1, \dots, N$ , replace the $j$ -th coordinate with $x_{j}^{(k)}$ but keep $x_{i, - j}$ as is. Call the modified row $\tilde{x}_{i}^{(k)}$ .
- Compute $\overset{y}{^}_{i}^{(k)} = \hat{f} (\tilde{x}_{i}^{(k)})$ .
- Average: $\overset{ˉ}{f}_{j} (x_{j}^{(k)}) = \frac{1}{N} \sum_{i} \overset{y}{^}_{i}^{(k)}$ .
Plot $\overset{ˉ}{f}_{j} (x_{j}^{(k)})$ versus $x_{j}^{(k)}$ .

Insights & mental models

Marginal effect, not causal effect. PDPs show how the prediction changes with $X_{j}$ “after accounting for” the other variables, averaging out their joint distribution. They are not “what would happen if $X_{j}$ changed and nothing else did” (that’s a counterfactual / causal question).
Why use the data and not the population.

“You don’t have the underlying distribution, so you use the data to approximate that distribution.” - L21-unsupervised-1 Same bootstrapping logic, empirical distribution stands in for the unknown joint of $X_{- j}$ .
Local interpretability, not global understanding. PDPs give you a one-variable curve. They don’t tell you about interactions (for that, you’d need 2-D PDPs or ICE / Shapley methods, all out of scope here).
Fragility under correlation. Averaging over the data with $X_{j}$ pinned can create implausible synthetic observations:

“You might be looking at a tiny, tiny house with like 50 bedrooms, which doesn’t make any sense.” - L21-unsupervised-1 When predictors are highly correlated, the PDP averages over combinations that essentially never occur in reality.
Stability check. PDPs are computed from the data; they wobble across resamples / bootstraps. If the overall structure is stable that’s a sign of believability; a noisy PDP across reruns is a sign the marginal effect is weak or the model is unstable on that dimension.
Pairs naturally with variable-importance. Importance plots tell you which variables matter; PDPs tell you how the top variables matter. The two together give you a usable picture of an opaque ensemble.

Worked example, Boston housing (slides + lecture)

After fitting boost.boston <- gbm(medv~., ..., n.trees=5000, interaction.depth=4), the slide deck shows PDPs for rm (number of rooms, the most important predictor) and lstat (% lower-status population). Both are in the slide refs:

plot(boost.boston, i = "rm",    ylab = "medv")
plot(boost.boston, i = "lstat", ylab = "medv")

PDP for rm is monotone increasing (more rooms → higher predicted price). PDP for lstat is monotone decreasing (higher %low-SES → lower predicted price). Magnitude and shape (kinks, plateaus) are the qualitative info you’d quote on the exam.

Exam signals

“Partial dependence plots… know what they mean and where they come from, but you won’t compute them.” - L27-summary (paraphrased, see the importance plot callout for the full Q6c context: “[Variable] importance plots are fair game, know what they mean and where they come from, but you won’t compute them.”)

The same logic the prof applied to importance plots applies to PDPs: in scope (slides + lecture), explicitly fair game for “what does this show” interpretation, not for hand-computation.

“If it was covered either in the slides or in the exercises, then I would say fair game.” - L27-summary

Pitfalls

Reading PDPs as causal effects. They aren’t. They are averages of model predictions, conditional on a fixed $X_{j}$ , useful for understanding what the model learned, not for “what would happen if I changed this in the world.”
Forgetting they assume independence between $X_{j}$ and $X_{- j}$ . When predictors are heavily correlated, the PDP averages over implausible combinations and can be misleading.
Treating one PDP as a global model summary. A PDP shows one variable; the model’s behavior also depends on interactions and on combinations of other variables. PDPs are point summaries, not the whole picture.
Confusing PDP with the marginal distribution of $\overset{y}{^}$ at $X_{j} = x_{j}$ . PDP averages over training $X_{- j}$ ; the marginal averages over the joint $(X_{j}, X_{- j})$ at the data’s natural distribution. Different objects.

Scope vs ISLP

In scope: the empirical estimator $\overset{ˉ}{f}_{j} (x_{j}) = \frac{1}{N} \sum_{i} f (x_{j}, x_{i, - j})$ ; the marginalization-over- $X_{- j}$ idea; what a PDP shows (qualitative effect of $X_{j}$ on $\overset{y}{^}$ , after accounting for others); the contrast with ” $X$ if everything else were not there” (causal); pairing with variable-importance for tree-ensemble interpretation.
Look up in ISLP: PDPs aren’t deeply covered in ISLP’s chapter 8; the slide deck refers to Elements of Statistical Learning §10.13.2 for the full treatment. Anders does not need this depth, the empirical formula and the marginalization framing are enough.
Skip in ISLP (book-only / out of scope):
- ICE plots (individual conditional expectation): not lectured.
- Shapley / SHAP values - L21-unsupervised-1 mentions interpretability machinery in passing, never derives.
- Two-variable / interaction PDPs: not lectured.

Exercise instances

None. No Exercise 9 problem touches PDPs explicitly, they’re pure lecture / slide content.

How it might appear on the exam

Output interpretation: given a PDP for one variable from a boosted-tree fit, describe the qualitative effect (e.g. “predicted house price increases monotonically with number of rooms; the effect plateaus past 7 rooms”).
Multiple choice / true-false: “a PDP shows the causal effect of $X_{j}$ on $Y$ ” (false, it’s a marginal effect on the model’s prediction), “a PDP averages the model’s prediction across the data with $X_{j}$ fixed” (true).
Conceptual short-answer: “you have a black-box gradient-boosted model and you want to understand how lstat affects predicted price. What do you compute?” Expected: the partial dependence of $\hat{f}$ on lstat, sweep lstat, hold the rest at the data values, average the predictions, plot.
Pseudocode: given the empirical PDP formula, write the algorithm in pseudocode (per the open-book / no-language rule, pseudocode is acceptable).
Pitfall awareness: “when can a PDP mislead you?” Expected: highly correlated predictors → averaging over implausible $(x_{j}, x_{- j})$ combinations.

boosting: the parent context; PDPs are one of the two interpretability tools the prof covers for boosted ensembles.
gradient-boosting / xgboost, the typical opaque-ensemble PDPs are computed from.
random-forest: same interpretability gap, same PDP fix; PDPs work for any black-box $\hat{f}$ .
variable-importance: the other interpretability tool from module 8 / 9; importance answers “which variables matter,” PDPs answer “how do the important ones matter.”
regression-tree / classification-tree, the components of the ensembles that PDPs interpret; single trees don’t need PDPs because they’re already interpretable.

statistical.dog

Explorer

partial-dependence-plots

Partial dependence plots

Definition (prof’s framing)

Notation & setup

How you’d compute one (pseudocode-ish, exam-friendly)

Insights & mental models

Worked example, Boston housing (slides + lecture)

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

partial-dependence-plots

Partial dependence plots

Definition (prof’s framing)

Notation & setup

How you’d compute one (pseudocode-ish, exam-friendly)

Insights & mental models

Worked example, Boston housing (slides + lecture)

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks