Compulsory Exercise 3 — MC review

Question 1 2 points CE3 P2b

Answer the following multiple choice questions by using the Covid-19 data to model the probability of deceased as a function of sex, age and country (with France as reference level; no interactions). The fitted glm() output is shown below.

##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -7.633051   0.897063  -8.509  < 2e-16 ***
## sexmale           1.137246   0.343706   3.309 0.000937 ***
## age               0.068012   0.009846   6.907 4.94e-12 ***
## countryindonesia -0.754259   0.815127  -0.925 0.354796
## countryjapan     -2.434101   0.667826  -3.645 0.000268 ***
## countryKorea     -1.366797   0.374837  -3.646 0.000266 ***

Which of the following statements are true, which false?

(i) Country is not a relevant variable in the model. True False
(ii) The slope for indonesia has a large $p$-value, which shows that we should remove the Indonesian population from the model, as they do not fit the model as well as the Japanese population. True False
(iii) Increasing the age by 10 years, $x^*_{age} = x_{age} + 10$, and holding all other covariates constant, the odds ratio to die increases by a factor of 1.97. True False
(iv) The probability to die is approximately 3.12 larger for males than for females. True False

Show answer

Solution: FALSE — FALSE — FALSE — FALSE.

False — an analysis of deviance test on the full model gives $p \approx 0.0002$ for country, so country is a relevant variable. A non-significant single dummy (Indonesia vs France) doesn't mean the whole factor is irrelevant.
False — a large $p$-value for one dummy level only means we have no evidence that that level differs from the reference; it is not a justification to drop a subpopulation from a model. You don't remove observations because their group is "not different from reference".
False — the calculation $\exp(10 \cdot \hat\beta_{age}) = \exp(10 \cdot 0.068) \approx 1.97$ is correct, but this is the multiplicative change in the odds, not in the odds ratio.
False — $\exp(\hat\beta_{sex}) = \exp(1.137) \approx 3.12$ is the odds ratio (males vs females), not the probability ratio.

Atoms: logistic-regression, odds-and-odds-ratio.

Question 2 2 points CE3 P2f

Which of the following statements are true, which false?

Consider the classification tree below to answer:

Classification tree on age, country, sex predicting deceased.

Consider the LDA code and output below:

library(MASS)
table(predict = predict(lda(deceased ~ age + sex + country, data = d.corona))$class,
    true = d.corona$deceased)

##        true
## predict     0    1
##       0  1926   31
##       1    39   14

(i) The probability of dying (deceased = 1) is about 0.46 for a French person with age above 91. True False
(ii) Age seems to be a more important predictor for mortality than sex. True False
(iii) The "null rate" for misclassification is 2.24%, because this is the proportion of deaths among all cases in the dataset. No classifier should have a higher misclassification rate. True False
(iv) LDA is not a very useful method for this dataset, among other reasons because it does not estimate probabilities, but also because the misclassification error is too high. True False

Show answer

Solution: TRUE — TRUE — TRUE — FALSE.

Note from the official solution: statements (iii) and (iv) were later found to be ambiguous, so both True and False were graded as correct. Below is the most defensible reading.

True — follow the tree: age < 79.5 is False (age > 91), then country: indonesia,japan,Korea is False (French), then age < 91 is False. The terminal leaf reads 0.461500 ≈ 0.46.
True — age appears at the root and again deep in the tree, while sex appears only once (and as a tie-breaker in a single subtree). The classifier leans heavily on age.
True (most-defensible reading) — the null classifier "always predict 0 (alive)" misclassifies exactly the 45/2010 ≈ 2.24% who actually died. A useful classifier should beat this; LDA here gets $(39+31)/2010 \approx 3.48\%$, which is worse than the null. Officially graded ambiguous.
False — LDA does estimate (posterior) probabilities via Bayes' rule; that part of the statement is wrong. The misclassification claim is defensible (3.48% > 2.24% null), but the reason about probabilities is incorrect. Officially graded ambiguous.

Atoms: classification-tree, lda, confusion-matrix.

Question 3 2 points CE3 P4b

Inference vs prediction: Which of the following methods are suitable when the aim of your analysis is inference?

(i) Lasso and ridge regression True False
(ii) Multiple linear regression with interaction terms True False
(iii) Logistic regression True False
(iv) Support Vector Machines True False

Show answer

Solution: TRUE — TRUE — TRUE — FALSE.

True — lasso and ridge yield interpretable coefficients on the original predictors; lasso additionally selects variables. Both are routinely used for inference, though SE/$p$-values require post-selection care.
True — coefficients of an MLR (including interaction terms) have direct interpretations and standard inferential tooling ($t$-tests, CIs, $F$-tests).
True — coefficients are log-odds with interpretable signs/magnitudes and standard Wald-type inference.
False — SVM is a black-box geometric classifier; it does not produce interpretable parameters and is not used for inference about effects.

Atoms: inference-vs-prediction, lasso, ridge-regression.

Question 4 2 points CE3 P4c

We again look at the Covid-19 dataset from Problem 2 to study some properties of the bootstrap method. Below we estimated the standard errors of the regression coefficients in the logistic regression model with sex, age and country as predictors using 1000 bootstrap iterations (column std.error). These standard errors can be compared to those that we obtain by fitting a single logistic regression model using the glm() function. Look at the R output below and compare the standard errors that we obtain from these two approaches (note that the t1* to t6* variables are sorted in the same way as for the glm() output).

library(boot)
boot.fn <- function(data, index) {
    return(coefficients(glm(deceased ~ sex + age + country, family = "binomial",
        data = data, subset = index)))
}
boot(d.corona, boot.fn, 1000)

##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
## Call:
## boot(data = d.corona, statistic = boot.fn, R = 1000)
##
## Bootstrap Statistics :
##        original       bias    std. error
## t1* -7.63305130 -0.1529721699 0.783528214
## t2*  1.13724644  0.0387847701 0.376067951
## t3*  0.06801169  0.0009371457 0.008496607
## t4* -0.75425940 -1.8680017229 5.127173438
## t5* -2.43410057 -0.6257843968 2.979530357
## t6* -1.36679680  0.0126076844 0.381765945

# Logistic regression
r.glm <- glm(deceased ~ sex + age + country, d.corona, family = "binomial")
summary(r.glm)$coef

##                     Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)      -7.63305130 0.897063042 -8.5089352 1.755379e-17
## sexmale           1.13724644 0.343705727  3.3087794 9.370363e-04
## age               0.06801169 0.009846377  6.9072806 4.940322e-12
## countryindonesia -0.75425940 0.815127165 -0.9253273 3.547957e-01
## countryjapan     -2.43410057 0.667826265 -3.6448111 2.675883e-04
## countryKorea     -1.36679680 0.374836917 -3.6463772 2.659635e-04

Which of the following statements are true?

(i) There are large differences between the estimated standard errors, which indicates a problem with the bootstrap. True False
(ii) The differences between the estimated standard errors indicate a problem with the assumptions taken about the distribution of the estimated parameters in logistic regression. True False
(iii) The glm function leads to too small $p$-values for the differences between countries, in particular for the differences between Indonesia and France and between Japan and France. True False
(iv) The bootstrap relies on random sampling the same data without replacement. True False

Show answer

Solution: FALSE — TRUE — TRUE — FALSE.

False — bootstrap and glm SEs can differ; the bootstrap is treated as the more honest estimator (fewer parametric assumptions). A large gap is a signal about the parametric assumptions, not a problem with the bootstrap itself.
True — the glm SEs come from the asymptotic Wald approximation under specific distributional assumptions; when bootstrap SEs differ markedly (especially for countryindonesia and countryjapan, where bootstrap SE is many times larger), the Wald assumptions are suspect for those coefficients (small subgroup sizes, separation, etc.).
True — since the glm SEs for the Indonesia and Japan dummies are far smaller than the bootstrap SEs, the Wald $z = \hat\beta / \widehat{\text{SE}}$ is inflated and the resulting $p$-values are too small.
False — the bootstrap samples with replacement. "Without replacement" of the full $n$ would just return the original dataset.

Atoms: bootstrap, logistic-regression.

Question 5 2 points CE3 P5a

Which of the following are techniques for regularization?

(i) Lasso True False
(ii) Ridge regression True False
(iii) Forward and backward selection True False
(iv) Stochastic gradient descent True False

Show answer

Solution: TRUE — TRUE — FALSE — TRUE.

True — lasso adds an $\ell_1$ penalty $\lambda\sum |\beta_j|$ that shrinks and can zero out coefficients.
True — ridge adds an $\ell_2$ penalty $\lambda\sum \beta_j^2$, shrinking coefficients toward zero.
False — forward/backward selection is a subset selection (variable-selection) technique, not a continuous shrinkage / regularization method.
True — SGD has an implicit regularization effect (especially with early stopping); in the neural-network context it is counted among regularization techniques the course covers.

Atoms: regularization, lasso, ridge-regression, subset-selection.

Question 6 2 points CE3 P5b

Which of the following statements about principal component regression (PCR) and partial least squares (PLS) are correct?

(i) PCR involves the first principal components that are most correlated with the response. True False
(ii) PLS involves the first principal components that are most correlated with the response. True False
(iii) The idea in PLS is that we choose the principal components that explain most variation among all covariates. True False
(iv) The idea in PCR is that we choose the principal components that explain most variation among all covariates. True False

Show answer

Solution: FALSE — TRUE — FALSE — TRUE.

False — PCR is unsupervised: it picks components by variance in $X$, ignoring the response.
True — PLS is the supervised version: components are chosen for high correlation with $y$.
False — that description belongs to PCR, not PLS.
True — PCR's components maximise variance among the covariates.

Atoms: pcr, pls, pca.

Question 7 1 point CE3 P5c

In ridge regression, we estimate the regression coefficients in a linear regression model by minimizing $$\sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\right)^2 + \lambda\sum_{j=1}^{p}\beta_j^2.$$ What happens when we increase $\lambda$ from 0? Choose the single correct statement:

A The training RSS will steadily decrease.
B The test RSS will steadily decrease.
C The test RSS will steadily increase.
D The bias will steadily increase.
E The variance of the estimator will steadily increase.

Show answer

Correct answer: D

D — at $\lambda = 0$ ridge is OLS (unbiased). As $\lambda$ grows the coefficients are shrunk toward zero, so bias grows monotonically; in the limit $\lambda \to \infty$ all $\hat\beta_j \to 0$ and bias is maximal.

A — training RSS increases with $\lambda$ (we're moving away from the OLS minimum of training RSS).

B and C — test RSS is U-shaped in $\lambda$: it typically decreases first, hits a minimum, then increases. Neither "steadily decrease" nor "steadily increase" is correct.

E — variance moves the opposite way: shrinkage reduces variance as $\lambda$ grows. That's exactly the bias–variance trade-off ridge exploits.

Atoms: ridge-regression, bias-variance-tradeoff.

Question 8 1 point CE3 P5d

Which statement about the curse of dimensionality is correct?

A It means that we have a bias-variance tradeoff in $K$-nearest neighbor regression, where large $K$ leads to more bias but less variance for the predictor function.
B It means that the performance of the $K$-nearest neighbor classifier gets worse when the number of predictor variables $p$ is large.
C It means that the $K$-means clustering algorithm performs bad if the datapoints lie in a high-dimensional space.
D It means that support vector machines with radial kernel function should be avoided, because radial kernels correspond to infinite-dimensional polynomial boundaries.
E It means that we should never measure too many covariates when we want to do classification.

Show answer

Correct answer: B

B — in high dimensions, "nearest" neighbours are no longer near in any meaningful sense (points become roughly equidistant). KNN, which depends on local structure, degrades as $p$ grows.

A describes the bias–variance trade-off of KNN itself, not the curse of dimensionality.

C — the curse is a general high-dimensional phenomenon; it is not a property of $K$-means specifically.

D — wrong characterisation of radial kernels; radial-kernel SVMs are not generally avoided in high dimensions, and the reasoning given is not the curse.

E — a normative oversimplification: many covariates can be fine if local structure isn't required (linear models, trees with regularisation, etc.).

Atoms: curse-of-dimensionality, knn.

Question 9 1 point CE3 P5e

Now assume you have 10 covariates, $X_1$ to $X_{10}$, each of them uniformly distributed in the interval $[0, 1]$. To predict a new test observation $(X_1^{(0)}, \dots, X_{10}^{(0)})$ in a $K$-nearest neighbor (KNN) clustering approach, we use all observations within 20% of the range closest to each of the covariates (that is, in each dimension). Which proportion of available (training) observations can you expect to use for prediction?

A $1.02 \cdot 10^{-7}$
B $2.0 \cdot 10^{-3}$
C $0.20$
D $0.04$
E $10^{-10}$

Show answer

Correct answer: A

A — each covariate contributes an independent factor of $0.2$, so the hypercube fraction is $0.2^{10} = 1.024 \cdot 10^{-7}$. This is the canonical illustration of the curse of dimensionality.

B — $0.2^4 \approx 1.6\cdot 10^{-3}$; forgets six of the ten dimensions.

C — the per-dimension fraction (0.20). Ignores the product across dimensions entirely.

D — $0.2^2 = 0.04$; treats only two dimensions.

E — $0.1^{10}$; wrong base (uses 10% rather than 20%).

Atoms: curse-of-dimensionality, knn.

Question 10 2 points CE3 P5f

This example is taken from a real clinical study by Ikeda, Matsunaga, Irabu, et al. Using vital signs to diagnose impaired consciousness: cross sectional observational study. BMJ 2002;325:800. Researchers investigated the use of vital signs as a screening test to identify brain lesions in patients with impaired consciousness. The setting was an emergency department in Japan. The study included 529 consecutive patients that arrived with consciousness. Patients were followed until discharge. The vital signs of systolic and diastolic blood pressure and pulse rate were recorded on arrival. The aim of this study was to find a quick test for assessing whether the newly arrived patient suffered from a brain lesion. While vital signs can be measured immediately, the actual diagnosis of a brain lesion can only be determined on the basis of brain imaging and neurological examination at a later stage, thus the quick measurements of blood pressure and heart rate are important to make a quick assessment. In total, 312 patients (59%) were diagnosed with a brain lesion.

The performance of each vital sign (systolic blood pressure, diastolic blood pressure and heart rate) was separately evaluated as a screening test to quickly diagnose brain lesions. To assess the quality of each of these vital signs, different thresholds were taken successively to discriminate between "negative" and "positive" screening test result. For each vital sign and each threshold the sensitivity and specificity were derived and used to plot a receiver operating characteristic (ROC) curve for the vital sign (Figure 1):

ROC curves for systolic blood pressure, diastolic blood pressure, and pulse rate. — Figure 1: Figure for problem 5f); taken from *P. Sedgwick, BMJ 2011;343*.

Which of the following statements are true?

(i) The value of 1-specificity represents the proportion of patients without a diagnosed brain lesion identified as positive on screening. True False
(ii) When we use different cut-offs, sensitivity increases at the cost of lower specificity, and vice versa. True False
(iii) A perfect diagnostic test has an AUC of 0.5. True False
(iv) The vital sign that is most suitable to distinguish between patients with and without brain lesion is systolic blood pressure. True False

Show answer

Solution: TRUE — TRUE — FALSE — TRUE.

True — specificity = $P(\text{neg test} \mid \text{no lesion})$, so $1 - \text{specificity}$ is the false-positive rate, i.e. the proportion of patients without a lesion that the test calls positive.
True — moving the threshold along the ROC curve trades sensitivity for specificity. That trade-off is exactly what the curve traces out.
False — a perfect test has AUC = 1 (the curve hugs the top-left corner). AUC = 0.5 corresponds to a useless test (the diagonal).
True — the systolic-blood-pressure curve sits highest and farthest from the diagonal in the figure, so it has the largest AUC and is the best discriminator.

Atoms: roc-curve, sensitivity-specificity, auc.

Question 11 2 points CE3 P5g

We study the decathlon2 dataset from the factoextra package in R, where Athletes' performance during a sporting meeting was recorded. We look at 23 athletes and the results from the 10 disciplines in two competitions. Some rows of the dataset are displayed here:

decathlon2.active[c(1, 3, 4), ]

##          100m long_jump shot_put high_jump  400m 110.hurdle discus
## SEBRLE  11.04      7.58    14.83      2.07 49.81      14.69  43.75
## BERNARD 11.02      7.23    14.25      1.92 48.93      14.99  40.87
## YURKOV  11.34      7.09    15.19      2.10 50.42      15.31  46.26
##         pole_vault javeline 1500m
## SEBRLE        5.02    63.19 291.7
## BERNARD       5.32    62.77 280.1
## YURKOV        4.72    63.44 276.4

From a principal component analysis we obtain the biplot given in Figure 2.

PCA biplot of the decathlon2 dataset. — Figure 2: Figure for question 5g).

Which of the following statements are true, which false?

(i) The athlete named CLAY seems to be one of the fastest 1500m runners. True False
(ii) Athletes that are good in 100m tend to be also good in long jump. True False
(iii) The first principal component has the highest loadings for 100m and long jump. True False
(iv) 110m hurdle has a very small loading for PC2. True False

Show answer

Solution: FALSE — TRUE — TRUE — TRUE.

False — CLAY sits at a negative PC2 value, and 1500m's PC2 loading is also negative. Two negatives align, so CLAY's projected 1500m time is large — i.e. CLAY runs slow, not fast.
True — 100m and long_jump arrows lie along the same PC1 axis (in opposite directions, because low 100m times go with high long-jump distances). Athletes scoring high on this axis tend to excel at both.
True — in absolute value, the PC1 loadings for 100m and long_jump are the largest in the biplot.
True — the 110.hurdle arrow is almost parallel to PC1, so its PC2 component is close to zero.

Atoms: pca, biplot, principal-components.