Problem 1

a)

import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize, poly)
 
Auto = load_data('Auto')
Auto = Auto.drop(columns=['name'])
# Auto['origin'] = Auto['origin'].astype('category')
Auto.describe()
Auto.info()
 
Auto['origin'] = Auto['origin'].astype('category')

import seaborn as sns
sns.pairplot(Auto, plot_kws={'s': 4}, height=1.2)

b)

Correlation matrix for the Auto data set where we omit the column origin:

Auto.drop(columns=['origin']).corr()

c)

terms = Auto.columns.drop('mpg')
X = MS(list(terms)).fit_transform(Auto)
y = Auto['mpg']
results = sm.OLS(y, X).fit()
summarize(results)
results.summary()

i. The F-statistic in the very last line of the summary output shows that it is extremely unlikely that we would see such a result if none of the variables had any explanatory power — the $p$ -value is $< 2.2 e - 16$ ! So we can be sure that the set of predictors is explaining the response to a quite large degree. This is also confirmed by both a quite high $R^{2}$ and $R_{a d j u s t}^{2}$ (both around 0.82). ii. Yes, there is very strong evidence for a relationship between the weight of a car and the response. The $p$ -value is extremely small. The interpretation is that, if a car weights 1000 kg more, we have a decrease of $1000 \cdot (- 6.710 \cdot 1 0^{- 3}) = - 6.71$ miles per gallon (mpg), that is, the car can drive 6.71 miles less far per gallon of fuel. iii. The estimated coefficient $\hat{β}_{year} = 0.77$ suggests that a newer car model has higher mpg compared to an older one. A one year newer model should thus, in average, be able to drive $0.77$ miles longer per gallon fuel compared to the older car.

d)

Remember that if we want to test whether a categorical variable with more than two levels should be removed from the model, we actually have to test whether all slope coefficients that are associated with the respective binary dummy variables are zero simultaneously. Look again at equation (3.30) in the course book (p.85).

Consequently, we have to test here whether $β_{origin2} = β_{origin3} = 0$ at the same time, and for this we need the $F$ -test. We can fit the reduced model (without origin) and compare to the full model with anova_lm():

terms_no_origin = [t for t in terms if t != 'origin']
X_red = MS(terms_no_origin).fit_transform(Auto)
results_red = sm.OLS(y, X_red).fit()
anova_lm(results_red, results)

The $p$ -value associated with adding origin to the model is very small, thus the origin of the car has an influence on the response mpg.

e)

fig, axes = subplots(2, 2, figsize=(9, 7))
 
# Residuals vs Fitted
axes[0, 0].scatter(results.fittedvalues, results.resid, s=8)
axes[0, 0].axhline(0, c='k', ls='--')
axes[0, 0].set_xlabel('Fitted value')
axes[0, 0].set_ylabel('Residual')
axes[0, 0].set_title('Residuals vs Fitted')
 
# Normal QQ
from scipy import stats
infl = results.get_influence()
std_resid = infl.resid_studentized_internal
stats.probplot(std_resid, dist='norm', plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
 
# Scale-Location
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(std_resid)), s=8)
axes[1, 0].set_xlabel('Fitted value')
axes[1, 0].set_ylabel(r'$\sqrt{|\mathrm{Standardized\ residuals}|}$')
axes[1, 0].set_title('Scale-Location')
 
# Residuals vs Leverage
axes[1, 1].scatter(infl.hat_matrix_diag, std_resid, s=8)
axes[1, 1].axhline(0, c='k', ls='--')
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized residual')
axes[1, 1].set_title('Residuals vs Leverage')
 
fig.tight_layout()
 
np.argmax(infl.hat_matrix_diag)

In the residual vs fitted plot (the so-called Tukey-Anscombe plot) there is evidence of non-linearity.
Observation $14$ has an unusually high leverage. This does not necessarily need to be a problem, but it would be wise to double-check that this observation is not an outlier.

f)

from scipy import stats
 
rng = np.random.default_rng(2332)
n = 100
 
fig, axes = subplots(2, 3, figsize=(9, 6))
for ax in axes.ravel():
    sim = rng.standard_normal(n)
    stats.probplot(sim, dist='norm', plot=ax)
    ax.set_title('')
fig.tight_layout()

g)

X1 = MS(['displacement', 'weight', 'year', 'origin', ('year', 'origin')]).fit_transform(Auto)
results1 = sm.OLS(y, X1).fit()
summarize(results1)

Again, if we want to find evidence whether the interaction term between year and origin is needed, we are actually testing whether two slope coefficients are zero at the same time ( $β_{year:origin1} = β_{year:origin2} = 0$ ). Consequently, we need the $F$ -test, comparing the model with and without the interaction:

X1_main = MS(['displacement', 'weight', 'year', 'origin']).fit_transform(Auto)
results1_main = sm.OLS(y, X1_main).fit()
anova_lm(results1_main, results1)

There is very strong evidence that the year-effect depends on the origin of the car, as can be seen by the $F$ -test that is given by the anova_lm() table ( $p = 0.0001057$ ). For European (2) and Japanese (3) cars, it seems that the fuel consumption (mpg) has a steeper slope for year: $\hat{β}_{y e a r} = 0.615$ for the reference category 1 (American), whereas $\hat{β}_{year} = 0.615 + 0.519$ and $\hat{β}_{year} = 0.615 + 0.356$ for category 2 (European) and 3 (Japanese), respectively. This means that the fuel consumption is reduced faster by cars produced outside America (note that mgp is “miles per gallon”, thus a higher value means that the car consumes less fuel, as it can drive further per gallon).

Note: For a full understanding of interaction terms, you really do need both the summarize() and the anova_lm() tables.

h)

We try three different transformations and look at the residual plots:

# try 3 predictor transformations
 
Auto = Auto.assign(sqrtmpg=np.sqrt(Auto['mpg']),
                   logweight=np.log(Auto['weight']),
                   weight2=Auto['weight'] ** 2)
 
def resid_plot(res, title):
    fig, ax = subplots(figsize=(5, 4))
    ax.scatter(res.fittedvalues, res.resid, s=8)
    ax.axhline(0, c='k', ls='--')
    ax.set_xlabel('Fitted value'); ax.set_ylabel('Residual')
    ax.set_title(title)
    fig.tight_layout()
 
X3 = MS(['displacement', 'weight', 'year', 'origin']).fit_transform(Auto)
results3 = sm.OLS(Auto['sqrtmpg'], X3).fit()
resid_plot(results3, 'sqrt(mpg) ~ displacement + weight + year + origin')
 
X4 = MS(['displacement', 'logweight', 'year', 'origin']).fit_transform(Auto)
results4 = sm.OLS(Auto['mpg'], X4).fit()
resid_plot(results4, 'mpg ~ displacement + log(weight) + year + origin')
 
X5 = MS(['displacement', 'weight2', 'year', 'origin']).fit_transform(Auto)
results5 = sm.OLS(Auto['mpg'], X5).fit()
resid_plot(results5, 'mpg ~ displacement + weight^2 + year + origin')

Problem 2

a)

E (\hat{β}) = E ((X^{T} X)^{- 1} X^{T} Y) = (X^{T} X)^{- 1} X^{T} E (Y) = (X^{T} X)^{- 1} X^{T} E (X β + ε) = (X^{T} X)^{- 1} X^{T} (X β + 0) = (X^{T} X)^{- 1} (X^{T} X) β = I β = β

Cov (\hat{β}) = Cov ((X^{T} X)^{- 1} X^{T} Y) = (X^{T} X)^{- 1} X^{T} Cov (Y) ((X^{T} X)^{- 1} X^{T})^{T} = (X^{T} X)^{- 1} X^{T} σ^{2} I ((X^{T} X)^{- 1} X^{T})^{T} = σ^{2} (X^{T} X)^{- 1}

We need to assume that $Y$ is multivariate normal. As $\hat{β}$ is a linear transformation of a multivariate normal vector $Y$ , $\hat{β}$ is also multivariate normal.

All components of a multivariate normal vector are themselves univariate normal. This means that $\hat{β}_{j}$ is normally distributed with expected value given by the $β_{j}$ and the variance given by the $j$ ‘th diagonal element of $σ^{2} (X^{T} X)^{- 1}$ .

b)

Fix covariates X and repeat the following procedure many times (nsim = 1000)

collect $Y$
create CI using $\hat{β}$ and $\overset{σ}{^}$

95 % of the times the CI contains the true $β .$

import numpy as np
import pandas as pd
import statsmodels.api as sm
from ISLP.models import ModelSpec as MS
 
# CI for beta_j
beta0 = 1
beta1 = 3
true_beta = np.array([beta0, beta1])  # vector of model coefficients
true_sd = 1  # choosing true sd
nobs = 100
 
rng = np.random.default_rng(0)
X = rng.uniform(0, 1, size=nobs)
df = pd.DataFrame({'x': X})
Xmat = MS(['x']).fit_transform(df)  # design matrix
 
# Count how many times the true value is within the confidence interval
nsim = 1000
ci_int = np.zeros(nsim)
ci_x = np.zeros(nsim)
for i in range(nsim):
    y = rng.normal(loc=Xmat.values @ true_beta, scale=true_sd, size=nobs)
    mod = sm.OLS(y, Xmat).fit()
    ci = mod.conf_int()  # rows = (intercept, x), cols = (low, high)
    ci_int[i] = 1 if (true_beta[0] >= ci[0, 0] and true_beta[0] <= ci[0, 1]) else 0
    ci_x[i] = 1 if (true_beta[1] >= ci[1, 0] and true_beta[1] <= ci[1, 1]) else 0
 
np.array([ci_int.mean(), ci_x.mean()])

c)

We apply the same idea from b), but now we fix X and $x_{0}$ and

collect $Y$
create PI using $\hat{β}$ and $\overset{σ}{^}$
simulate $Y_{0}$

95 % of the times the PI contains $Y_{0} .$

# PI for Y_0
beta0 = 1
beta1 = 3
true_beta = np.array([beta0, beta1])  # vector of model coefficients
true_sd = 1  # choosing true sd
 
rng = np.random.default_rng(0)
X = rng.uniform(0, 1, size=100)
df = pd.DataFrame({'x': X})
design = MS(['x']).fit(df)
Xmat = design.transform(df)  # design matrix
 
x0 = np.array([1.0, 0.4])
new_df = pd.DataFrame({'x': [x0[1]]})
newX = design.transform(new_df)
 
# simulating and fitting models many times
nsim = 1000
pi_y0 = np.zeros(nsim)
for i in range(nsim):
    y = rng.normal(loc=Xmat.values @ true_beta, scale=true_sd, size=100)
    mod = sm.OLS(y, Xmat).fit()
    y0 = rng.normal(loc=x0 @ true_beta, scale=true_sd)
    pi = mod.get_prediction(newX).conf_int(obs=True, alpha=0.05)[0]
    pi_y0[i] = 1 if (y0 >= pi[0] and y0 <= pi[1]) else 0
 
pi_y0.mean()

d)

Confidence Interval for $x_{0}^{T} β$

This corresponds to a CI for $μ_{0} = E (y_{0}) = x_{0}^{T} β$ of an observation $y_{0}$ at $x_{0} .$

First, by using a) we have that

$E (x_{0}^{T} \hat{β}) = x_{0}^{T} E (\hat{β}) = x_{0}^{T} β$

and

$Var (x_{0}^{T} \hat{β}) = x_{0}^{T} Var (\hat{β}) x_{0} = σ^{2} x_{0}^{T} (X^{T} X)^{- 1} x_{0} .$

Consequently,

x_{0}^{T} \hat{β} \sim N (x_{0}^{T} β, σ^{2} x_{0}^{T} (X^{T} X)^{- 1} x_{0})

or equivalently

\frac{x _{0}^{T} β ^ - μ _{0}}{σ x _{0}^{T} ( X ^{T} X ) ^{- 1} x _{0}} \sim N (0, 1) .

By substituting $σ^{2}$ with an estimator $\hat{σ^{2}},$ the last expression follows a $t -$ distribution with $n - p$ degrees of freedom. Now we can construct a confidence interval with $1 - α$ level (in our case $α = 0.05$ ) as follows

P (- t_{n - p} (1 - α /2) \leq \frac{x _{0}^{T} β ^ - μ _{0}}{σ ^ x _{0}^{T} ( X ^{T} X ) ^{- 1} x _{0}} \leq t_{n - p} (1 - α /2)) = 1 - α

P (x_{0}^{T} \hat{β} - t_{n - p} (1 - α /2) \overset{σ}{^} x_{0}^{T} (X^{T} X)^{- 1} x_{0} \leq μ_{0} \leq x_{0}^{T} \hat{β} + t_{n - p} (1 - α /2) \overset{σ}{^} x_{0}^{T} (X^{T} X)^{- 1} x_{0}) = 1 - α

For $α = 0.05$ we have the following $95%$ CI

[x_{0}^{T} \hat{β} - t_{n - p} (1 - 0.05/2) \overset{σ}{^} x_{0}^{T} (X^{T} X)^{- 1} x_{0}, x_{0}^{T} \hat{β} + t_{n - p} (1 - 0.05/2) \overset{σ}{^} x_{0}^{T} (X^{T} X)^{- 1} x_{0}]

Prediction interval

To construct the PI we can look at the distribution of the prediction error $\overset{ε}{^}_{0} = y_{0} - \overset{y}{^}_{0} = y_{0} - x_{0}^{T} β .$

We have that

E (\overset{ε}{^}_{0}) = y_{0} - E (\overset{y}{^}_{0}) = y_{0} - E (x_{0}^{T} \hat{β}) = y_{0} - y_{0} = 0

and by assuming that $y_{0}$ and $\overset{y_{0}}{^}$ are independent and $y_{0} \sim N (x_{0}^{T} β, σ^{2})$ we have that

Var (\overset{ε}{^}_{0}) = Var (y_{0}) + Var (\overset{y}{^}_{0}) = σ^{2} + σ^{2} x_{0}^{T} (X^{T} X)^{- 1} x_{0} = σ^{2} (1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0})

and

\overset{ε}{^}_{0} = y_{0} - x_{0}^{T} \hat{β} \sim N (0, σ^{2} (1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0})) .

Following the same logic as for the before and by substituting the estimate of variance $\hat{σ^{2}}$ we can construct a prediction interval with $1 - α$ level as follows

P (- t_{n - p} (1 - α /2) \leq \frac{y _{0} - x _{0}^{T} β ^}{σ ^ 1 + x _{0}^{T} ( X ^{T} X ) ^{- 1} x _{0}} \leq t_{n - p} (1 - α /2)) = 1 - α

P (x_{0}^{T} \hat{β} - t_{n - p} (1 - α /2) \overset{σ}{^} 1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0} \leq y_{0} \leq x_{0}^{T} \hat{β} + t_{n - p} (1 - α /2) \overset{σ}{^} 1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0}) = 1 - α

For $α = 0.05$ we have the following $95%$ PI

[x_{0}^{T} \hat{β} - t_{n - p} (1 - 0.05/2) \overset{σ}{^} 1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0}, x_{0}^{T} \hat{β} + t_{n - p} (1 - 0.05/2) \overset{σ}{^} 1 + x_{0}^{T} (X^{T} X)^{- 1} x_{0}]

Observe now that the two intervals are similar, but the prediction interval is by construction wider. Specifically, it can be significantly wider in applications where the error variance $\hat{σ^{2}}$ is large.

The connection between CI for $β$ , $x_{0}^{T} β$ and PI for $y$ at $x_{0}$ : The first is CI for a parameter, the second is CI for the expected regression line at the point $x_{0}$ (when you only have one covariate, this may be more intuitive), and the last is the PI for the response $y_{0}$ . The difference between the two latter is that $y$ are the observations, and $x_{0}^{T} β$ is the expected value of the observations and hence a function of the model parameters (NOT an observation).

e)

We have a model on the form $Y = X β + ε$ where $ε$ is the error. The error of the model is unknown and unobserved, but we can estimate it by what we call the residuals. The residuals are given by the difference between the true response and the predicted value $\overset{ε}{^} = Y - \hat{Y} = (I - = H X (X^{T} X)^{- 1} X^{⊤}) Y,$ where $H$ is called the “hat matrix”.

Properties of raw residuals: Normally distributed with mean 0 and covariance $C o v (\overset{ε}{^}) = σ^{2} (I - H) .$ This means that the residuals may have different variance (depending on $X$ ) and may also be correlated.

In a model check, we want to check that our errors are independent, homoscedastic (same variance for each observation) and not dependent on the covariates. As we don’t know the true error, we use the residuals as predictors, but as mentioned, the residuals may have different variances and may be correlated. This is why we don’t want to use the raw residuals for model check.

To amend our problem we need to try to fix the residuals so that they at least have equal variances. We do that by working with standardized or studentized residuals.

statistical.dog

Explorer

Module 3: Recommended Exercises - Solution

Problem 1

a)

b)

c)

d)

e)

f)

g)

h)

Problem 2

a)

b)

c)

d)

Confidence Interval for $x_{0}^{T} β$

Prediction interval

e)

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

Module 3: Recommended Exercises - Solution

Problem 1

a)

b)

c)

d)

e)

f)

g)

h)

Problem 2

a)

b)

c)

d)

Confidence Interval for x0T​β

Prediction interval

e)

Graph View

Table of Contents

Backlinks

Confidence Interval for $x_{0}^{T} β$