Math exercises (ISLP)

Every “mathy derivation” problem found in the ISLP textbook end-of-chapter exercises, restricted to chapters that are in scope per docs/scope.md. Same filter as math-exercises.md: derivations, proofs, algebraic show-that questions, and hand-calculations whose algebraic structure is the point. Conceptual prose, R/Python coding, and pure plot-interpretation problems are excluded.

Source links point to the ISLP chapter markdown in wiki/book/. Borderline / scope-edge problems carry a brief Scope note.

ISLP chapters that map to in-scope modules:

Module	ISLP ch	File
2 — Stat learning	2	02-statlearn.md
3 — Linear regression	3	03-linreg.md
4 — Classification	4	04-classif.md
5 — Resampling	5	05-resample.md
6 — Model selection	6	06-modelsel.md
7 — Beyond linear	7	07-beyondlinear.md
8 — Trees, 9 — Boosting	8	08-trees.md
10 — Unsupervised	12	12-unsupervised.md
11 — Neural networks	10	10-deeplearning.md

ISLP ch 9 (SVM), ch 11 (survival), and ch 13 (multiple testing) are excluded — prof took them out of scope.

Module 2 — Statistical learning

Chapter 2’s conceptual exercises (2.4.1–2.4.7) are all verbal/intuition prompts (KNN by hand, bias-variance reasoning) — no algebraic derivations. Nothing in scope from this chapter. The bias-variance derivation lives in the lecture notes and Compulsory 1, not ISLP exercises.

Module 3 — Linear regression

Exercise 3.5 — Fitted values as linear combinations of responses

Source: 03-linreg.md:1470-1490 (Conceptual 5)

Consider the fitted values that result from performing linear regression without an intercept. In this setting, the $i$ th fitted value takes the form

$\overset{y}{^}_{i} = x_{i} \hat{β},$

where

$\hat{β} = \frac{\sum _{i = 1}^{n} x _{i} y _{i}}{\sum _{i^{'} = 1}^{n} x _{i^{'}}^{2}} . (3.38)$

Show that we can write

$\overset{y}{^}_{i} = \sum_{i^{'} = 1}^{n} a_{i^{'}} y_{i^{'}} .$

What is $a_{i^{'}}$ ?

Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

Exercise 3.6 — Regression line passes through the means

Source: 03-linreg.md:1492 (Conceptual 6)

Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point $(\overset{x}{ˉ}, \overset{y}{ˉ})$ .

Exercise 3.7 — R² equals squared correlation

Source: 03-linreg.md:1494 (Conceptual 7)

It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$ , the $R^{2}$ statistic (3.17) is equal to the square of the correlation between $X$ and $Y$ (3.18). Prove that this is the case. For simplicity, you may assume that $\overset{x}{ˉ} = \overset{y}{ˉ} = 0$ .

Exercise 3.11 (d, e) — Algebraic form of the t-statistic; symmetry under x ↔ y swap

Source: 03-linreg.md:1542-1554 (Applied 11, parts d–e)

For the regression of $Y$ onto $X$ without an intercept, the $t$ -statistic for $H_{0} : β = 0$ takes the form $\hat{β} / SE (\hat{β})$ , where $\hat{β}$ is given by (3.38), and where

$SE (\hat{β}) = \frac{\sum _{i = 1}^{n} ( y _{i} - x _{i} β ^ ) ^{2}}{( n - 1 ) \sum _{i^{'} = 1}^{n} x _{i^{'}}^{2}} .$

(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the $t$ -statistic can be written as

$\frac{( n - 1 ) \sum _{i = 1}^{n} x _{i} y _{i}}{( \sum _{i = 1}^{n} x _{i}^{2} ) ( \sum _{i^{'} = 1}^{n} y _{i^{'}}^{2} ) - ( \sum _{i^{'} = 1}^{n} x _{i^{'}} y _{i^{'}} ) ^{2}} .$

(e) Using the results from (d), argue that the $t$ -statistic for the regression of y onto x is the same as the $t$ -statistic for the regression of x onto y.

Module 4 — Classification

Exercise 4.8.1 — Logistic ⇔ logit equivalence

Source: 04-classif.md:1861 (Conceptual 1)

Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.

Exercise 4.8.2 — Bayes classifier reduces to LDA discriminant

Source: 04-classif.md:1863 (Conceptual 2)

It was stated in the text that classifying an observation to the class for which (4.17) is largest is equivalent to classifying an observation to the class for which (4.18) is largest. Prove that this is the case. In other words, under the assumption that the observations in the $k$ th class are drawn from a $N (μ_{k}, σ^{2})$ distribution, the Bayes classifier assigns an observation to the class for which the discriminant function is maximized.

Exercise 4.8.3 — Bayes classifier is quadratic when σ²_k differs

Source: 04-classif.md:1865-1869 (Conceptual 3)

This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class-specific mean vector and a class specific covariance matrix. We consider the simple case where $p = 1$ ; i.e. there is only one feature.

Suppose that we have $K$ classes, and that if an observation belongs to the $k$ th class then $X$ comes from a one-dimensional normal distribution, $X \sim N (μ_{k}, σ_{k}^{2})$ . Recall that the density function for the one-dimensional normal distribution is given in (4.16). Prove that in this case, the Bayes classifier is not linear. Argue that it is in fact quadratic.

Hint: For this problem, you should follow the arguments laid out in Section 4.4.1, but without making the assumption that $σ_{1}^{2} = \dots = σ_{K}^{2}$ .

Exercise 4.8.7 — Predicted probability from Bayes theorem with Gaussian classes

Source: 04-classif.md:1901-1903 (Conceptual 7)

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on $X$ , last year’s percent profit. We examine a large number of companies and discover that the mean value of $X$ for companies that issued a dividend was $\overset{ˉ}{X} = 10$ , while the mean for those that didn’t was $\overset{ˉ}{X} = 0$ . In addition, the variance of $X$ for these two sets of companies was $\overset{σ}{^}^{2} = 36$ . Finally, 80% of companies issued dividends. Assuming that $X$ follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was $X = 4$ last year.

Hint: Recall that the density function for a normal random variable is $f (x) = \frac{1}{2 π σ ^{2}} e^{- (x - μ)^{2} /2 σ^{2}}$ . You will need to use Bayes’ theorem.

Exercise 4.8.9 — Odds ↔ probability conversion

Source: 04-classif.md:1907-1911 (Conceptual 9)

This problem has to do with odds.

(a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

(b) Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

Exercise 4.8.10 — Univariate LDA log-odds expressions (a_k, b_kj)

Source: 04-classif.md:1913 (Conceptual 10)

Equation 4.32 derived an expression for $lo g (\frac{P r ( Y = k ∣ X = x )}{P r ( Y = K ∣ X = x )})$ in the setting where $p > 1$ , so that the mean for the $k$ th class, $μ_{k}$ , is a $p$ -dimensional vector, and the shared covariance $Σ$ is a $p \times p$ matrix. However, in the setting with $p = 1$ , (4.32) takes a simpler form, since the means $μ_{1}, \dots, μ_{K}$ and the variance $σ^{2}$ are scalars. In this simpler setting, repeat the calculation in (4.32), and provide expressions for $a_{k}$ and $b_{k j}$ in terms of $π_{k}$ , $π_{K}$ , $μ_{k}$ , $μ_{K}$ , and $σ^{2}$ .

Scope note: invokes multi-class LDA framing; the multi-class generalization of logistic regression is OUT of scope (prof Jan 27), but LDA discriminant derivation itself is IN.

Exercise 4.8.11 — Detailed QDA discriminant coefficients

Source: 04-classif.md:1915 (Conceptual 11)

Work out the detailed forms of $a_{k}$ , $b_{k j}$ , and $b_{k j l}$ in (4.33). Your answer should involve $π_{k}$ , $π_{K}$ , $μ_{k}$ , $μ_{K}$ , $Σ_{k}$ , and $Σ_{K}$ .

Exercise 4.8.12 — Binary logistic ↔ softmax parameterization

Source: 04-classif.md:1917-1937 (Conceptual 12)

Suppose that you wish to classify an observation $X \in R$ into apples and oranges. You fit a logistic regression model and find that

$Pr (Y = orange ∣ X = x) = \frac{e x p ( β ^ _{0} + β ^ _{1} x )}{1 + e x p ( β ^ _{0} + β ^ _{1} x )} .$

Your friend fits a logistic regression model to the same data using the softmax formulation in (4.13), and finds that

$Pr (Y = orange ∣ X = x) = \frac{e x p ( α ^ _{orange 0} + α ^ _{orange 1} x )}{e x p ( α ^ _{orange 0} + α ^ _{orange 1} x ) + e x p ( α ^ _{apple 0} + α ^ _{apple 1} x )} .$

(a) What is the log odds of orange versus apple in your model?

(b) What is the log odds of orange versus apple in your friend’s model?

(c) Suppose that in your model, $\hat{β}_{0} = 2$ and $\hat{β}_{1} = - 1$ . What are the coefficient estimates in your friend’s model? Be as specific as possible.

(d) Now suppose that you and your friend fit the same two models on a different data set. This time, your friend gets the coefficient estimates $\overset{α}{^}_{orange 0} = 1.2$ , $\overset{α}{^}_{orange 1} = - 2$ , $\overset{α}{^}_{apple 0} = 3$ , $\overset{α}{^}_{apple 1} = 0.6$ . What are the coefficient estimates in your model?

(e) Finally, suppose you apply both models from (d) to a data set with 2,000 test observations. What fraction of the time do you expect the predicted class labels from your model to agree with those from your friend’s model? Explain your answer.

Scope note: softmax over-parameterization. Useful framing for NN output layer, but the multi-class logistic specifics are OUT of scope per prof. Treat as illustrative for the binary ↔ softmax algebra.

Module 5 — Resampling

Exercise 5.4.1 — Derive minimum-variance portfolio weight

Source: 05-resample.md:633 (Conceptual 1)

Using basic statistical properties of the variance, as well as single-variable calculus, derive (5.6). In other words, prove that $α$ given by (5.6) does indeed minimize $Var (α X + (1 - α) Y)$ .

Exercise 5.4.2 — Bootstrap inclusion probability

Source: 05-resample.md:635-653 (Conceptual 2)

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of $n$ observations.

(a) What is the probability that the first bootstrap observation is not the $j$ th observation from the original sample? Justify your answer.

(b) What is the probability that the second bootstrap observation is not the $j$ th observation from the original sample?

(c) Argue that the probability that the $j$ th observation is not in the bootstrap sample is $(1 - 1/ n)^{n}$ .

(d) When $n = 5$ , what is the probability that the $j$ th observation is in the bootstrap sample?

(e) When $n = 100$ , what is the probability that the $j$ th observation is in the bootstrap sample?

(f) When $n = 10, 000$ , what is the probability that the $j$ th observation is in the bootstrap sample?

(g) Create a plot that displays, for each integer value of $n$ from 1 to 100,000, the probability that the $j$ th observation is in the bootstrap sample. Comment on what you observe.

Module 6 — Model selection / regularization

Exercise 6.6.5 — Ridge vs lasso geometry under correlated predictors

Source: 06-modelsel.md:1416-1422 (Conceptual 5)

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that $n = 2$ , $p = 2$ , $x_{11} = x_{12}$ , $x_{21} = x_{22}$ . Furthermore, suppose that $y_{1} + y_{2} = 0$ and $x_{11} + x_{21} = 0$ and $x_{12} + x_{22} = 0$ , so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: $\hat{β}_{0} = 0$ .

(a) Write out the ridge regression optimization problem in this setting.

(b) Argue that in this setting, the ridge coefficient estimates satisfy $\hat{β}_{1} = \hat{β}_{2}$ .

(c) Write out the lasso optimization problem in this setting.

(d) Argue that in this setting, the lasso coefficients $\hat{β}_{1}$ and $\hat{β}_{2}$ are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

ISLP 6.6.7 (Bayesian connection to ridge/lasso via Gaussian / Laplace priors) is omitted — prof Mar 2: “I really don’t think I’d put this on the test… kind of assumes a lot of knowledge that maybe you don’t have.”

Module 7 — Beyond linearity (splines / GAMs)

Exercise 7.9.1 — Cubic spline basis: continuity of f, f′, f″ at the knot

Source: 07-beyondlinear.md:1124-1156 (Conceptual 1)

It was mentioned in this chapter that a cubic regression spline with one knot at $ξ$ can be obtained using a basis of the form $x, x^{2}, x^{3}, (x - ξ)_{+}^{3}$ , where $(x - ξ)_{+}^{3} = (x - ξ)^{3}$ if $x > ξ$ and equals 0 otherwise. We will now show that a function of the form

$f (x) = β_{0} + β_{1} x + β_{2} x^{2} + β_{3} x^{3} + β_{4} (x - ξ)_{+}^{3}$

is indeed a cubic regression spline, regardless of the values of $β_{0}, β_{1}, β_{2}, β_{3}, β_{4}$ .

(a) Find a cubic polynomial

$f_{1} (x) = a_{1} + b_{1} x + c_{1} x^{2} + d_{1} x^{3}$

such that $f (x) = f_{1} (x)$ for all $x \leq ξ$ . Express $a_{1}, b_{1}, c_{1}, d_{1}$ in terms of $β_{0}, β_{1}, β_{2}, β_{3}, β_{4}$ .

(b) Find a cubic polynomial

$f_{2} (x) = a_{2} + b_{2} x + c_{2} x^{2} + d_{2} x^{3}$

such that $f (x) = f_{2} (x)$ for all $x > ξ$ . Express $a_{2}, b_{2}, c_{2}, d_{2}$ in terms of $β_{0}, β_{1}, β_{2}, β_{3}, β_{4}$ . We have now established that $f (x)$ is a piecewise polynomial.

(c) Show that $f_{1} (ξ) = f_{2} (ξ)$ . That is, $f (x)$ is continuous at $ξ$ .

(d) Show that $f_{1}^{'} (ξ) = f_{2}^{'} (ξ)$ . That is, $f^{'} (x)$ is continuous at $ξ$ .

(e) Show that $f_{1}^{''} (ξ) = f_{2}^{''} (ξ)$ . That is, $f^{''} (x)$ is continuous at $ξ$ .

Therefore, $f (x)$ is indeed a cubic spline.

Scope note: prof said Mar 9 he won’t derive natural-spline basis math; this is the truncated-power cubic-spline basis, which is foundational. Borderline.

Exercise 7.9.5 — Comparing smoothing-spline penalty orders

Source: 07-beyondlinear.md:1182-1192 (Conceptual 5)

Consider two curves, $\overset{g}{^}_{1}$ and $\overset{g}{^}_{2}$ , defined by

$\overset{g}{^}_{1} = ar g min_{g} (\sum_{i = 1}^{n} (y_{i} - g (x_{i}))^{2} + λ \int [g^{(3)} (x)]^{2} d x),$

$\overset{g}{^}_{2} = ar g min_{g} (\sum_{i = 1}^{n} (y_{i} - g (x_{i}))^{2} + λ \int [g^{(4)} (x)]^{2} d x),$

where $g^{(m)}$ represents the $m$ th derivative of $g$ .

(a) As $λ \to \infty$ , will $\overset{g}{^}_{1}$ or $\overset{g}{^}_{2}$ have the smaller training RSS?

(b) As $λ \to \infty$ , will $\overset{g}{^}_{1}$ or $\overset{g}{^}_{2}$ have the smaller test RSS?

(c) For $λ = 0$ , will $\overset{g}{^}_{1}$ or $\overset{g}{^}_{2}$ have the smaller training and test RSS?

Modules 8 & 9 — Trees & boosting

Exercise 8.4.2 — Why boosting stumps gives an additive model

Source: 08-trees.md:895-901 (Conceptual 2)

It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form

$f (X) = \sum_{j = 1}^{p} f_{j} (X_{j}) .$

Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.

Exercise 8.4.3 — Gini, classification error, entropy as functions of p̂

Source: 08-trees.md:903-905 (Conceptual 3)

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of $\overset{p}{^}_{m 1}$ . The $x$ -axis should display $\overset{p}{^}_{m 1}$ , ranging from 0 to 1, and the $y$ -axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, $\overset{p}{^}_{m 1} = 1 - \overset{p}{^}_{m 2}$ . You could make this plot by hand, but it will be much easier to make in R.

Scope note: the by-hand version (write each formula in p̂, evaluate at a few points) is the exam-shaped version; the R-plot framing is for the lab.

Exercise 8.4.4 — Tree ↔ partition reconstruction

Source: 08-trees.md:907-912 (Conceptual 4)

This question relates to the plots in Figure 8.14.

Figure 8.14. Left: A partition of the predictor space corresponding to Exercise 4a. Right: A tree corresponding to Exercise 4b.

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of $Y$ within each region.

(b) Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

Exercise 8.4.5 — Bagging: majority vote vs average probability

Source: 08-trees.md:914-920 (Conceptual 5)

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of $X$ , produce 10 estimates of $P (Class is Red ∣ X)$ :

$0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.$

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

Exercise 8.4.6 — Regression-tree fitting algorithm

Source: 08-trees.md:922 (Conceptual 6)

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Module 10 — Unsupervised (ISLP ch 12)

Exercise 12.6.1 — Prove K-means monotone decrease

Source: 12-unsupervised.md:1281-1283 (Conceptual 1)

This problem involves the $K$ -means clustering algorithm.

(a) Prove (12.18).

(b) On the basis of this identity, argue that the $K$ -means clustering algorithm (Algorithm 12.2) decreases the objective (12.17) at each iteration.

Exercise 12.6.2 — Hand-build dendrograms (complete & single linkage)

Source: 12-unsupervised.md:1285-1299 (Conceptual 2)

Suppose that we have four observations, for which we compute a dissimilarity matrix, given by

0.3 0.4 0.7 0.3 0.5 0.8 0.4 0.5 0.45 0.7 0.8 0.45 .

For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster?

(d) Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?

(e) It is mentioned in this chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

Exercise 12.6.3 — Hand-execute K-means with K=2, n=6, p=2

Source: 12-unsupervised.md:1301-1317 (Conceptual 3)

In this problem, you will perform $K$ -means clustering manually, with $K = 2$ , on a small example with $n = 6$ observations and $p = 2$ features. The observations are as follows.

Obs.	$X_{1}$	$X_{2}$
1	1	4
2	1	3
3	0	4
4	5	1
5	6	2
6	4	0

(a) Plot the observations.

(b) Randomly assign a cluster label to each observation. You can use the np.random.choice() function to do this. Report the cluster labels for each observation.

(c) Compute the centroid for each cluster.

(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

(e) Repeat (c) and (d) until the answers obtained stop changing.

(f) In your plot from (a), color the observations according to the cluster labels obtained.

Exercise 12.6.4 — Single vs complete linkage fusion heights

Source: 12-unsupervised.md:1319-1321 (Conceptual 4)

Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters ${1, 2, 3}$ and ${4, 5}$ fuse. On the complete linkage dendrogram, the clusters ${1, 2, 3}$ and ${4, 5}$ also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

(b) At a certain point on the single linkage dendrogram, the clusters ${5}$ and ${6}$ fuse. On the complete linkage dendrogram, the clusters ${5}$ and ${6}$ also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

Exercise 12.6.6 — PCA loadings via p separate least-squares regressions

Source: 12-unsupervised.md:1325-1327 (Conceptual 6)

We saw in Section 12.2.2 that the principal component loading and score vectors provide an approximation to a matrix, in the sense of (12.5). Specifically, the principal component score and loading vectors solve the optimization problem given in (12.6).

Now, suppose that the $M$ principal component score vectors $z_{im}$ , $m = 1, \dots, M$ , are known. Using (12.6), explain that each of the first $M$ principal component loading vectors $ϕ_{j m}$ , $m = 1, \dots, M$ , can be obtained by performing $p$ separate least squares linear regressions. In each regression, the principal component score vectors are the predictors, and one of the features of the data matrix is the response.

Module 11 — Neural networks (ISLP ch 10)

Exercise 10.10.1 — Two-hidden-layer NN: write f(X), count parameters

Source: 10-deeplearning.md:2379-2383 (Conceptual 1)

Consider a neural network with two hidden layers: $p = 4$ input units, 2 units in the first hidden layer, 3 units in the second hidden layer, and a single output.

(a) Draw a picture of the network, similar to Figures 10.1 or 10.4.

(b) Write out an expression for $f (X)$ , assuming ReLU activation functions. Be as explicit as you can!

(c) Now plug in some values for the coefficients and write out the value of $f (X)$ .

(d) How many parameters are there?

Exercise 10.10.2 — Softmax invariance under additive shifts

Source: 10-deeplearning.md:2385-2389 (Conceptual 2)

Consider the softmax function in (10.13) (see also (4.13) on page 145) for modeling multinomial probabilities.

(a) In (10.13), show that if we add a constant $c$ to each of the $z_{ℓ}$ , then the probability is unchanged.

(b) In (4.13), show that if we add constants $c_{j}$ , $j = 0, 1, \dots, p$ , to each of the corresponding coefficients for each of the classes, then the predictions at any new point $x$ are unchanged.

This shows that the softmax function is over-parametrized. However, regularization and SGD typically constrain the solutions so that this is not a problem.

Scope note: softmax is the NN multi-class output activation. Useful invariance proof but multi-class logistic specifics are OUT of scope per prof.

Exercise 10.10.3 — Multinomial log-likelihood reduces to binomial for M=2

Source: 10-deeplearning.md:2391 (Conceptual 3)

Show that the negative multinomial log-likelihood (10.14) is equivalent to the negative log of the likelihood expression (4.5) when there are $M = 2$ classes.

Scope note: same as above — useful for understanding the binary case from the multinomial form, but the multinomial framing itself is OUT.

ISLP 10.10.4 (CNN parameter counting + filter weight constraints) is omitted — prof Apr 28 said detailed CNN math is out of scope.

statistical.dog

Explorer