We strongly recommend you to work through the Section 3.6 in the course book (Lab on linear regression)
Problem 1 (Extension from Book Ex. 9)
This question involves the use of multiple linear regression on the Auto data set from the ISLP package (you may use Auto? to see a description of the data). First we exclude from our analysis the variable name and look at the data summary and structure of the dataset.
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize, poly)
Auto = load_data('Auto')
Auto = Auto.drop(columns=['name'])
# Auto['origin'] = Auto['origin'].astype('category')
Auto.describe()
Auto.info()We obtain a summary and see that all variables are numerical (continuous). However, when we check the description of the data (again with Auto?) we immediately see that origin is actually encoding for either American (origin = 1), European (origin = 2) or Japanese (origin = 3) origin of the car, thus the values , and do not have any actual numerical meaning. We therefore need to first change the data type of that variable to let Python know that we are dealing with a qualitative (categorical) variable, instead of a continuous one (otherwise we will obtain wrong model fits). In pandas such variables are called categorical (a synonymous for “qualitative predictor”), and before we continue to do any analyses we first need to convert origin into a categorical variable:
Auto['origin'] = Auto['origin'].astype('category')a)
Use seaborn’s pairplot() (or pandas.plotting.scatter_matrix) to produce a scatterplot matrix which includes all of the variables in the data set.
b)
Compute the correlation matrix between the variables. You will need to remove the categorical covariate origin, because this is no longer a continuous variable.
c)
Use MS() together with sm.OLS() to perform a multiple linear regression with mpg (miles per gallon, a measure for fuel consumption) as the response and all other variables (except name) as the predictors. Use summarize() (or results.summary()) to print the results. Comment on the output. In particular:
i. Is there a relationship between the predictors and the response?
ii. Is there evidence that the weight of a car influences mpg? Interpret the regression coefficient (what happens if a car weights 1000kg more, for example?).
iii. What does the estimated coefficient for the year variable suggest?
d)
Look again at the regression output from question c). Now we want to test whether the origin variable is important. How does this work for a categorical variable with more than only two levels?
e)
Produce diagnostic plots of the linear regression fit (residuals vs. fitted, QQ plot of standardized residuals, scale-location, residuals vs. leverage). Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
f)
For beginners, it can be difficult to decide whether a certain QQ plot looks “good” or “bad”, because we only look at it and do not test anything. A way to get a feeling for how “bad” a QQ plot may look, even when the normality assumption is perfectly OK, we can use simulations: We can simply draw from the normal distribution and plot the QQ plot. Use the following code to repeat this six times:
from scipy import stats
rng = np.random.default_rng(2332)
n = 100
fig, axes = subplots(2, 3, figsize=(9, 6))
for ax in axes.ravel():
sim = rng.standard_normal(n)
stats.probplot(sim, dist='norm', plot=ax)
ax.set_title('')
fig.tight_layout()g)
Let us look at interactions. These can be included in the model matrix by passing tuples to MS() (see Section 3.6.4 in the course book).
Fit another model for mpg, including only displacement, weight, year and origin as predictors, plus an interaction between year and origin. The interaction year*origin (in the R formula sense) corresponds in MS() to including the main effects together with the interaction tuple, i.e. MS(['displacement', 'weight', 'year', 'origin', ('year', 'origin')]). Is there evidence that the interactions term is relevant? Give an interpretation of the result.
h)
Try a few different transformations of the variables, such as . See Section 3.6.5 in the course book for how to do this. Perhaps you manage to improve the residual plots that you got in e)? Comment on your findings.
Problem 2
a)
A core finding for the least-squares estimator of linear regression models is with .
- Show that has this distribution with the given mean and covariance matrix.
- What do you need to assume to get to this result?
- What does this imply for the distribution of the th element of ?
- In particular, how can we calculate the variance of ?
b)
What is the interpretation of a 95% confidence interval? Hint: repeat experiment (on ), on average how many CIs cover the true ? The following code shows an interpretation of a confidence interval. Study and fill in the code where is needed
- Model: , with .
beta0 = ...
beta1 = ...
true_beta = np.array([beta0, beta1]) # vector of model coefficients
true_sd = 1 # choosing true sd
nobs = 100
rng = np.random.default_rng(0)
X = rng.uniform(0, 1, size=nobs) # simulate the predictor variable X
df = pd.DataFrame({'x': X})
Xmat = MS(['x']).fit_transform(df) # design matrix
# Count how many times the true value is within the confidence interval
ci_int = np.zeros(1000)
ci_x = np.zeros(1000)
nsim = 1000
for i in range(nsim):
y = rng.normal(loc=Xmat.values @ true_beta, scale=true_sd, size=nobs)
mod = sm.OLS(y, Xmat).fit()
ci = mod.conf_int() # array shape (2, 2): rows = params, cols = (low, high)
# if true value of beta0 is within the CI then 1 else 0
ci_int[i] = 1 if ... else 0
# if true value of beta_1 is within the CI then 1 else 0
ci_x[i] = 1 if ... else 0
np.array([ci_int.mean(), ci_x.mean()])c)
What is the interpretation of a 95% prediction interval? Hint: repeat experiment (on ) for a given . Write Python code that shows the interpretation of a 95% PI. Hint: In order to produce the PIs use the data point Furthermore, you may use a similar code structure as in b).
d)
Construct a 95% CI for . Explain the connections between a CI for , a CI for and a PI for at .
e)
Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?