Big Data Analytics - Coursework 1
Student name:     Hammond Mason
Student number:
Programme:          M.Sc.(Data Science)
1. Statistical Learning Methods

For each of parts (a) through (d) indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method.  Justify your answer.


(a) The sample size \(n\) is extremely large and the number of predictors is small.

The more ‘flexible’ the method, the more predictors are added to attempt a better fit to the sample data.  This can lead to ‘overfitting’ where additional predictors result in a better fit but, in doing so, capture some of the randomness (‘noise’) inherent in the sample data.  This can result in either or both of (a) a deterioration in the predictive power of the model, or (b) the inclusion of predictors whose effect on the predicted variable is difficult to rationalise.

When the sample size is extremely large, there is greater potential for overfitting.  However, when the number of predictors for that sample is small, the chance of overfitting is reduced.  Consequently, the performance of a flexible method would generally be better than an inflexible method in such a case.


(b) The number of predictors \(p\) is extremely large and the number of observations is small.

With a small number of observations and a large number of predictors available to choose from, it is very easy to overfit a flexible model - we just keep adding predictors until our model fits the data as best it can.  Consequently, in such a case, an inflexible model will generally provide better performance than a flexible model.


(c) The relationship between the predictors and response is highly non-linear.

When there is a high degree of non-linearity, an inflexible method with a restricted set of predictors, is unlikely to capture the full relationship and a flexible method would generally outperform it.


(d) The variance of the error terms, ie. \(\sigma\)2 = Var(\(\epsilon\)), is extremely high.

A basic tenet of statistical learning is that an observed random variable, \(Y\), is influenced by one or more other random variables, \(X\), in a way that can be described by the mathematical equation:


\(Y = f(X) + \epsilon\)


where \(f(X)\) is the function describing the process by which \(Y\) is influenced by \(X\) and \(\epsilon\) represents all other influences on \(Y\).  It is assumed further, that \(\epsilon\) has a constant variance, an expected value of zero and is unrelated to the \(X\) variables, ie. \(\epsilon\) is “white noise”.

The statistical learning process consists of identifying a model, \(\hat{f}(X)\), that best estimates the true process, \(f(X)\).  As a result of this estimation process, we have the mathematical relationship:


\(Y = \hat{f}(X) + u\)


Where \(u\) are the ‘residual’ terms, being the mathematical by-product of the (observed) \(Y\) values and the model’s estimates from the (observed) \(X\) values, ie:


\(u = Y - \hat{f}(X)\)


It is important to understand \(u \neq \epsilon\).  The \(u\) are directly influenced by our choice of model, \(\hat{f}(X)\), as the above equation shows, but \(\epsilon\) is unrelated to our choice of model.  Regardless of how well our statistical model approximates \(f(X)\), it has no impact on \(\epsilon\).  Consequently, the choice of flexible vs. inflexible is irrelevant.  If Var(\(\epsilon\)) is extremely high, then it matters not which model type we choose: any model will have very little power to explain the variability in our observed \(Y\) as most of this variability is due to \(\epsilon\) which, by definition, is unrelated to \(f(X)\).


  1. Descriptive Analysis
oral = c(4, 1, 4, 5, 3, 2, 3, 4, 3, 5, 2, 2, 4, 3, 5, 5, 1, 1, 1, 2)
written = c(2, 3, 1, 4, 2, 5, 3, 1, 2, 1, 2, 2, 1, 1, 2, 3, 1, 2, 3, 4)


(a) Use R to calculate the mean, the mode, the median, the variance and the standard deviation of the oral and written exams separately and together as well.

R does not have an in-built function to find the Mode of a vector so we must create our own:

get_mode = function(v) {
  uniqv = unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Calculating our statistics separately:

> mean(oral)
[1] 3
> get_mode(oral)
[1] 4
> median(oral)
[1] 3
> var(oral)
[1] 2.105263
> sd(written)
[1] 1.164158
> 
> mean(written)
[1] 2.25
> get_mode(written)
[1] 2
> median(written)
[1] 2
> var(written)
[1] 1.355263
> sd(written)
[1] 1.164158

Calculating them together:

> exam.results = data.frame(oral, written)
> sapply(exam.results, mean)
   oral written 
   3.00    2.25 
> sapply(exam.results, get_mode)
   oral written 
      4       2 
> sapply(exam.results, median)
   oral written 
      3       2 
> sapply(exam.results, var)
    oral  written 
2.105263 1.355263 
> sapply(exam.results, sd)
    oral  written 
1.450953 1.164158 


(b) Find the covariance and correlation between the oral and written exam scores.

> cov(oral, written)
[1] -0.3157895
> cor(oral, written)
[1] -0.1869531


(c) Is there a positive or negative or no correlation between the two?

The (Pearson) correlation coefficient of -0.1869531 indicates a slight negative linear correlation between Oral and Written exam results.


(d) Is there causation between the two? Justify your answers.

The presence of correlation never implies causation.  Causation is a proposition developed from reasoned argument.  Correlation can be used to support (or refute) such a proposition but it can never indicate causation.


3. Descriptive Analysis
> library("ISLR")
> head(Auto)
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
5  17         8          302        140   3449         10.5   70      1
6  15         8          429        198   4341         10.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
5               ford torino
6          ford galaxie 500


(a) Which of the predictors are quantitative, and which are qualitative?

> quantVars= c("mpg", "displacement", "horsepower", "weight", "acceleration")
> qualVars = c("cylinders", "year", "origin", "name")


(b) What is the range of each quantitative predictor? You can answer this using the range() function.

> rng = sapply(Auto[quantVars], range)
> rownames(rng) = c("Min", "Max")
> rng
     mpg displacement horsepower weight acceleration
Min  9.0           68         46   1613          8.0
Max 46.6          455        230   5140         24.8


(c) What is the mean and standard deviation of each quantitative predictor?

> Mean = sapply(Auto[quantVars], mean)
> Stdev= sapply(Auto[quantVars], sd)
> rbind(Mean, Stdev)
            mpg displacement horsepower    weight acceleration
Mean  23.445918      194.412  104.46939 2977.5842    15.541327
Stdev  7.805007      104.644   38.49116  849.4026     2.758864


(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

> e = 10:85
> subSet = Auto[-e,]
> myData = sapply(subSet[quantVars], range)
> rownames(myData) = c("Min", "Max")
> Mean = sapply(subSet[quantVars], mean)
> Stdev= sapply(subSet[quantVars], sd)
> myData = rbind(myData, Mean)
> myData = rbind(myData, Stdev)
> myData
            mpg displacement horsepower    weight acceleration
Min   11.000000     68.00000   46.00000 1649.0000     8.500000
Max   46.600000    455.00000  230.00000 4997.0000    24.800000
Mean  24.404430    187.24051  100.72152 2935.9715    15.726899
Stdev  7.867283     99.67837   35.70885  811.3002     2.693721


(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice.  Create some plots highlighting the relationships among the predictors.  Comment on your findings.

> pairs(Auto[quantVars], main="Scatterplots of Pairs of Predictor Variables")

The above chart depicts scatterplots for all combinations of the quantitative variables.  Viewing the charts in “columns”, we can see the following:

The “mpg” variable displays a non-linear relationship with displacement, horsepower and weight.  However, there appears no obvious relationship with acceleration as the data is more scattered that the others and no obvious relationship (linear or otherwise) is apparent.

“displacement” appears linearly related to horsepower, weight and acceleration.  Its relationship with mpg has already been noted in 1. above.

There appears to be linear relationships between “horsepower”" and weight, and horsepower and acceleration, although the latter may be slightly non-linear.  Horsepower’s relationship to the other quantitative variables has already been described in 1, and 2. above.

“weight” appears to be related to acceleration (weight’s relationship to other variables is discribed above).  However, there is a lot of dispersion between weight and acceleration suggesting the relationiship is not strong.


(f) Suppose that we wish to predict gas mileage (\(mpg\)) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting \(mpg\)? Justify your answer.

As explained in the previous question, “mpg” appears to have a non-linear relationship with each of displacement, horsepower and weight.  It appears there is little, if any, relationship between mpg and acceleration.


4. Linear Regression


(a) Use the lm() function to perform a simple linear regression with \(mpg\) as the response and \(horsepower\) as the predictor.

> horsepower = Auto$horsepower
> mpg = Auto$mpg
> ols = lm(mpg ~ horsepower)

Use the summary() function to print the results.

> summary(ols)

Call:
lm(formula = mpg ~ horsepower)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Comment on the output.  For example:

i. Is there a relationship between the predictor and the response?

Based on our sample of observations, we estimate the coefficient of the predictor to be -0.157845 with a standard error of 0.006446.  Consequently, our estimate is -24.49 standard deviations from zero.  This ratio is the t-value (shown in the above summary).

If the true value of \(\beta\)1 were zero, then the chances of observing an estimate of -0.157845 (being -24.49 standard deviations away from this hypothesised true value) is extremely low.  This is what the extremely low p-value of 7.031989e-81 confirms.

Hence, it is highly unlikely that we obtained our -0.157845 estimate by pure chance.  It is more likely that the true value of \(\beta\)1 is not zero, ie. there is a relationship between the predictor and the response.


ii. How strong is the relationship between the predictor and the response?

The R-Squared statistic tells us that 60.59% of the variability in \(mpg\) can be explained by \(horsepower\).  While 60.59% is a significant proportion, it must be remembered that over a third (39.41%) of the variability in \(mpg\) still remains unexplained.


iii. Is the relationship between the predictor and the response positive or negative?

As shown in the above summary table, there is a negative relationship between \(mpg\) and \(horsepower\) - the more horsepower, the lower the mpg, and vice versa.


iv. What is the predicted \(mpg\) associated with a \(horsepower\) of 98?  What are the associated 95% confidence and prediction intervals?

At a \(horsepower\) of 98, the predicted \(mpg\) is 24.46708.  This is our point estimate of the average \(mpg\) of a car that this level of \(horsepower\).

The 95% confidence interval for this average is [23.97308, 24.96108].

The 95% prediction interval for any single car is [14.8094, 34.12476].

> predict(ols, data.frame(horsepower=98), level=0.95, interval="confidence")
       fit      lwr      upr
1 24.46708 23.97308 24.96108
> predict(ols, data.frame(horsepower=98), level=0.95, interval="prediction")
       fit     lwr      upr
1 24.46708 14.8094 34.12476


(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(mpg ~ horsepower, 
  pch=19,
  cex=0.5, 
  main="mpg against horsepower"
)

newx = seq(min(horsepower), max(horsepower), len=100)

preds = predict(ols, newdata=data.frame(horsepower=newx), interval="prediction")
transparent_green = rgb(0, 255, 0, alpha=20, maxColorValue = 255)
polygon(
  c(rev(newx), newx), 
  c(rev(preds[ ,"upr"]), preds[ ,"lwr"]), 
  col = transparent_green, 
  border = NA
)
lines(newx, preds[,"lwr"], col="darkgreen", lwd=2)
lines(newx, preds[,"upr"], col="darkgreen", lwd=2)

confs = predict(ols, newdata=data.frame(horsepower=newx), interval="confidence")
transparent_red = rgb(255, 0, 0, alpha=80, maxColorValue = 255)

polygon(
  c(rev(newx), newx), 
  c(rev(confs[ ,"upr"]), confs[ ,"lwr"]), 
  col = transparent_red, 
  border = NA
)
lines(newx, confs[,"lwr"], col="red", lwd=2)
lines(newx, confs[,"upr"], col="red", lwd=2)

abline(ols, lty="dashed")

legend(150, 40,
  c("Regression", "Confidence Interval", "Prediction Interval"), 
  lty=c("dashed", "solid", "solid"),
  lwd=c(2, 2, 2),
  col=c("black", "red", "darkgreen"),
  bty="n"
)


5. Logistic Regression


Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median.  Explore logistic regression models using various subsets of the predictors.  Describe your findings.

> library("MASS")
> df = data.frame("hc" = ifelse(Boston$crim > median(Boston$crim), 1, 0), Boston[, -1])

As we will be repeating the logistic regression process for a variety of different predictor variables, it will be useful to put our logic into a single function and then call that function when needed, passing in the various predictor variables desired.  The function will provide details of the coefficients of these variables as well as an indicator of how well the model fits the training data (“fit accuracy”) and how accurately it predicts (“prediction accuracy”) in the test sample using LOOCV.

library("boot")
l = function(x) {
  fit = glm(as.formula(paste("hc ~", paste(x, collapse = " + "))), binomial, df)
  fit.coefs = round(summary(fit)$coefficients, 5)
  fit.preds = ifelse(predict(fit, type="response") > 0.5, 1, 0)
  fit.accuracy = mean(fit.preds == df$hc)
  predict.accuracy = 1 - cv.glm(df, fit, K=5)$delta[1]
  cat("\n")
  print(fit.coefs)
  cat(paste("Accuracy (fit):", round(fit.accuracy * 100, 1), "%\n"))
  cat(paste("Accuracy (predict):", round(predict.accuracy * 100, 1), "%\n"))
  return(predict.accuracy)
}

Using the above function, we can examine the case where all predictor variables are used in the regression:

> r = l(c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

             Estimate Std. Error  z value Pr(>|z|)
(Intercept) -34.10370    6.53001 -5.22261  0.00000
zn           -0.07992    0.03373 -2.36930  0.01782
indus        -0.05939    0.04372 -1.35834  0.17436
chas          0.78533    0.72893  1.07737  0.28132
nox          48.52378    7.39650  6.56037  0.00000
rm           -0.42560    0.70110 -0.60704  0.54383
age           0.02217    0.01222  1.81431  0.06963
dis           0.69140    0.21831  3.16708  0.00154
rad           0.65646    0.15245  4.30605  0.00002
tax          -0.00641    0.00269 -2.38468  0.01709
ptratio       0.36872    0.12214  3.01889  0.00254
black        -0.01352    0.00654 -2.06919  0.03853
lstat         0.04386    0.04898  0.89549  0.37052
medv          0.16713    0.06694  2.49671  0.01254
Accuracy (fit): 91.5 %
Accuracy (predict): 92.6 %

Which shows the model correctly predicts 92.6% of cases in the training data set, indicating the model fits the training data quite well.  However, there are a number of predictor variables whose estimated value is statistically insignificant from zero at a 0.05 p-value.  Removing these variables from the regression gives:

> r = l(c("zn", "nox", "dis", "rad", "tax", "ptratio", "black", "medv"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

             Estimate Std. Error  z value Pr(>|z|)
(Intercept) -28.34771    5.56986 -5.08948  0.00000
zn           -0.07450    0.02998 -2.48535  0.01294
nox          44.18044    6.28975  7.02420  0.00000
dis           0.48985    0.19493  2.51294  0.01197
rad           0.69212    0.13784  5.02107  0.00000
tax          -0.00745    0.00243 -3.06742  0.00216
ptratio       0.27215    0.10731  2.53605  0.01121
black        -0.01348    0.00633 -2.13005  0.03317
medv          0.08791    0.03079  2.85554  0.00430
Accuracy (fit): 89.3 %
Accuracy (predict): 92.7 %

The accuracy of the model has declined slightly to 92.7% but all of the predictor variables are statistically significant (at a 0.05 p-value).


6. Resampling Methods


Suppose that we use some statistical learning method to make a prediction for the response \(y\) for a particular value of the predictor \(x\).  Carefully describe how we might estimate the standard deviation of our prediction.

To estimate the standard deviation of our prediction, we can take the square root of the average squared deviation of the actual observation \((y_i)\) from the value predicted by the model \((\hat{y}_i)\).  Mathematically:

\(Prediction \mspace{5mu} standard \mspace{5mu} deviation = s = \sqrt{\displaystyle\frac{1}{n} .\displaystyle\sum_{i=1}^n (y_i - \hat{y}_i)^2}\)


where \(y_i\) is the observed \(i\)th value and \(\hat{y}_i\) is the model’s prediction for the \(i\)th value.

If we under-predict, then \(y_i - \hat{y}_i\) will be positive and ,if we over-predict, it will be negative.  Using the square of these deviations eliminates the problem of aggregating offsetting over- and under- predictions - which can hide the level of variability in our prediction model. For example, if the deviations were -7, +4, -3, +5, +1 the sum would be zero.  Using the sum of squares, however, we get 100 and the average squared deviation is 20.

This average squared deviation is called the Mean Squared Deviation (MSE) and, when calculating it for our predictions, we use the \(y_i\) observations from the test sample.  Of course, we calculate the parameters of our prediction model using the training data set, but calculating the MSE using this training data just gives an indication of how well the model ‘fits’ the training data - it says nothing about the model’s predictive power.

To get an indication of the model’s predictive power, we calculate the MSE on the test data and take the square root of the result to arrive at the standard deviation.  If the assumption is that the errors (not the residuals - see the discussion in Question 1d above) are normally distributed, then this standard deviation can be used to make statements about the model’s predictive power.  For example, in 19 out of 20 predictions we expect the true \(y_i\) value to fall within a range of 1.96 standard deviations either side of the model’s predicted value.  Obviously, a model with a lower standard deviation, has a ‘tighter’ band and, consequently, has better predictive power that one with a larger standard deviation.

While this MSE-based method works well in the regression situation, the classification scenario requires a different approach.  In a classification situation, \(y_i\) is no longer a continuous random variable.  Instead, it is discrete and binary, ie. it follows a Bernoulli process.  Consequently, the the squared deviation is also binary - it’s either been correctly predicted \((y_i = \hat{y}_i)\) or not \((y_i \neq \hat{y}_i)\).  There is no in-between.  In such a construction, the MSE has little value for comparing the predictive power of competing models.

For classification problems, statements about a model’s prediction accuracy are done using the Error Rate:

\(Error \mspace{5mu} Rate = \displaystyle\frac{1}{n} .\displaystyle\sum_{i=1}^n \mathcal H(y_i)\)

where \(\mathcal H\) is the Heavyside function:

\(\begin{equation} \mathcal H(y_i) = \begin{cases} 1 & \text{if $\hat{y}_i \neq y_i$}\\ 0 & \text{otherwise} \end{cases} \end{equation}\)


Here we can see that if, in a test sample of 100 observations, if our model fails to correctly classify 10 \(y_i\)s then the Error Rate is 0.1, ie. 10%.  Different classification models that produce different error rates, allows comparison.  As is the case with Regression, we calculate Error Rate using the test data set to assess predictive power.


7. Resampling Methods


We will now perform cross-validation on a simulated data set.


(a) Generate a simulated data set as follows:

set.seed(500)
y = rnorm(500)
x = 4-rnorm(500)
y = x - 2*x^2 + 3*x^4 + rnorm(500)

As the above will be repeated for different seeds in later questions, the following function will be used to generate a data set for a given seed, \(s\):

getSeededData = function(s) {
  set.seed(s)
  y = rnorm(500)
  x = 4-rnorm(500)
  y = x - 2*x^2 + 3*x^4 + rnorm(500)
  df= data.frame("y"=y, "x1"=x, "x2"=x^2, "x3"=x^3, "x4"=x^4)
  return(df)
}


In this data set, what is \(n\) and what is \(p\)?

\(n\) is the number of observations, ie. \(n\) = 500.
\(p\) is the number of predictors, ie. \(p\) = 3 (x, x2, x4).


Write out the model used to generate the data in equation form.

\(y = x - 2x^2 + 3x^4 + \epsilon\)


(b) Create a scatterplot of X against Y.  Comment on what you find.

df = getSeededData(500)
plot(df$x1, df$y, xlab="X", ylab="Y", main="Y on X", pch=19, cex=0.5, col=rgb(86, 36, 53, maxColorValue = 255))

From the above plot, it is clear the relationship between \(X\) and \(Y\) is non-linear.  While there is evidence of variability in \(x\), the rnorm(500) term in \(y\) has little visible impact as the random variates are small (mean=0 and sd=1) in comparison to the much larger deterministic component of \(y\).  Hence, the plot looks like a smooth line.  We can see the randomness if we zoom in, as we do in the next chart which zooms in on the x=(3, 3.1) and y=(200, 300):

plot(df$x1, df$y, xlab="X", ylab="Y", main="Y on X", pch=19, cex=0.5, col=rgb(86, 36, 53, maxColorValue = 255), xlim=c(3,3.1), ylim=c(200,300))

From this plot, we can see the points do not fit on a smooth line/curve.


(c) Set the seed to be 23, and then compute the LOOCV and 10-fold CV errors that result from fitting the following four models using least squares.  Note you may find it helpful to use the data.frame() function to create a single data set containing both \(X\) and \(Y\).

> df = getSeededData(23)
> library("boot")

As we will be calculating LOOCV and 10-fold CV repeatedly, we shall use the following function to encapsulate and simplify the code:

showMyCVs = function(fit) {
  loocv = (cv.glm(df, fit))$delta[1]
  kfold = (cv.glm(df, fit, K=10))$delta[1]
  cat(paste("LOOCV =", loocv, ", 10-fold CV =", kfold))
}


i.   \(Y = \beta_0 + \beta_1 X + \epsilon\)

> showMyCVs(glm(y ~ x1, data=df))
LOOCV = 195337.297150242 , 10-fold CV = 193314.650140994


ii.  \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2, data=df))
LOOCV = 11865.4098083307 , 10-fold CV = 12018.4546211804


iii. \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2+x3, data=df))
LOOCV = 151.481840745371 , 10-fold CV = 157.890934227387


iv.  \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2+x3+x4, data=df))
LOOCV = 1.04678407691056 , 10-fold CV = 1.05923599962837


(d) Repeat (c) using random seed 46, and report your results.  Are your results the same as what you got in (c)?  Why?

> df = getSeededData(46)


i.   \(Y = \beta_0 + \beta_1 X + \epsilon\)

> showMyCVs(glm(y ~ x1, data=df))
LOOCV = 212789.740894727 , 10-fold CV = 212191.155996556


ii.  \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2, data=df))
LOOCV = 12646.199351586 , 10-fold CV = 13106.3094597804


iii. \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2+x3, data=df))
LOOCV = 134.149872021215 , 10-fold CV = 137.344748362644


iv.  \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon\)

> showMyCVs(glm(y ~ x1+x2+x3+x4, data=df))
LOOCV = 1.0468606134498 , 10-fold CV = 1.05772631695867

The results are similar to, but not the same as, those in (c).  Due to the very nature of the sampling process, different samples (from the same population) will give rise to different estimates from the same statistical method.  By generating different sets from different random numbers our results demonstrate this.


(e) Which of the models in (c) had the smallest LOOCV and 10-fold CV error?  Is this what you expected?  Explain your answer.

Unsurprisingly, the most flexible model (iv) had the lowest cross-validation error under either approach (LOOCV or K-fold).  This was expected as the data generating function that generated the \(y\) values was also a 4th-order polynomial.


(f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares.  Do these results agree with the conclusions drawn based on the cross-validation results?

summary(glm(y ~ x1, data=df))$coefficients
              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) -2741.3934   82.89426 -33.07097 9.585235e-128
x1            949.2767   19.89434  47.71591 6.934050e-188

summary(glm(y ~ x1+x2, data=df))$coefficients
             Estimate Std. Error   t value      Pr(>|t|)
(Intercept)  2255.115  58.036609  38.85677 9.654660e-153
x1          -1658.011  28.884573 -57.40128 1.915860e-221
x2            318.612   3.482377  91.49267 3.775696e-313

summary(glm(y ~ x1+x2+x3, data=df))$coefficients
              Estimate Std. Error    t value      Pr(>|t|)
(Intercept) -602.94884 13.6068239  -44.31224 1.408780e-174
x1           709.02757 10.6442661   66.61122 1.457449e-249
x2          -291.54918  2.6695925 -109.21112  0.000000e+00
x3            49.43981  0.2145962  230.38533  0.000000e+00

summary(glm(y ~ x1+x2+x3+x4, data=df))$coefficients
              Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  1.7803085 2.96624820   0.6001886 0.5486552
x1          -0.5948906 3.28250441  -0.1812307 0.8562607
x2          -1.4931308 1.29941026  -1.1490835 0.2510765
x3          -0.0614358 0.21828407  -0.2814488 0.7784838
x4           3.0021387 0.01317774 227.8190129 0.0000000

We can see all coefficients of models i., ii. and iii. are statistically significant.  Their p-values are either at or near zero.  In these models, all three predictors influence \(y\) - including \(x^3\) which we know from the true data generating function, does not influence \(y\).  Furthermore, the estimated values for the coefficients are quite different from the true values of 1, -2, 0 and 3, for \(x\), \(x^2\), \(x^3\) and \(x^4\) respectively.

For model iv., however, only the 4th predictor (\(x^4\)) is statistically significant.  All other predictors have high p-values such that it is not possible to reject the null hypothesis that the true value of the coefficient is zero. The estimate of the coefficient of \(x^4\) is 3.0021387 which is very close to the true value of 3.

The fact that each model has, at least, one coefficient that is statistically significant from zero is in accordance with the cross-validation results above.