Big Data Analytics - Coursework 2

Student name:      Hammond Mason
Student number:  
Programme:          M.Sc.(Data Science)

Question 1 - Decision Trees

  1. Sketch the tree corresponding to the partition of the particular space illustrated in the left-hand panel of the figure above.  The numbers inside the box indicate the mean of Y within each region.

  1. Create a diagram similar to the left-hand panel of the figure, using the tree illustrated in the right-hand panel of the same figure.  You should divide up the predictor space into the correct regions, and indicate the mean for each region.

Question 2 - Regression Trees

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable.  Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
library(tree)
library(ISLR)
Carseats = na.omit(Carseats)
calculate_mse = function(model, dataset) {
  actual = dataset$Sales
  predicted = predict(model, newdata=dataset)
  mse = mean((actual-predicted)^2)
  return (mse)
}
  1. Split the data set into a training set and a test set.
n = nrow(Carseats)
set.seed(1)
idx = sample(n, n/2)
training = Carseats[idx,]
test = Carseats[-idx,]
  1. Fit a regression tree to the training set.  Plot the tree, and interpret the results.  What test MSE do you obtain?
tree = tree(Sales ~ ., data = training)
plot(tree, col="grey")
text(tree, pretty=0, col="red")

The above tree suggests the most significant influence on sales volume is product placement, viz. the relative quality of the shelving location at the store where the carseat was sold (ie. ShelvLoc).  Those stores where ShelvLoc was described as ‘Good’ resulted in higher average sales than stores where ShelvLoc was ‘Medium’ or ‘Bad’.

Price had the next most important influence on Sales.  Then came - depending on the price level - either CompPrice or Age as the next most important influence.  In total, the tree indicates 6 factors (out of a total of 10) influence Sales:

ShelveLoc, Price, Age, Advertising, Income, CompPrice
mse = calculate_mse(tree, test)
mse
[1] 4.148897
The Test data set has a mean standard error (MSE) of 4.148897.  Consequently, we would expect most observations to have their Sales value within the range of 2.037 either side of the value shown for the terminal node in the above tree for which the observation falls.

We can verify this by looking at the raw Test data:
predicted = predict(tree, newdata=test)
raw.data = data.frame(predicted, test)
library(knitr)
kable(head(raw.data))
predicted Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
1 9.116667 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
2 12.082000 11.22 111 48 16 260 83 Good 65 10 Yes Yes
3 7.512222 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
5 4.627200 4.15 141 64 3 340 128 Bad 38 13 Yes No
7 7.130769 6.63 115 105 0 45 108 Medium 71 15 Yes No
8 11.014000 11.85 136 81 15 425 120 Good 67 10 Yes Yes
For the first row we have the value for the predictor variables:

ShelveLoc = Bad
Price = 120
Age = 42
Advertising = 11

which, following the above tree, gives a value of 9.117 (the 4th terminal node from the left).  We see this in the ‘predicted’ column for this observation in the table above: 9.11667 (rounded to 3 decimal places = 9.117).  According to the above MSE, we expect the true value of Sales to fall between 9.11667-2.037 and 9.11667+2.037, ie. between 7.07967 and 11.15367.  As we see from the above table, it does: 7.07967 < 9.5 < 11.15367.

However, Sales for the 3rd observation (10.06) falls outside of the predicted range [5.47522, 9.54922].  This can happen, of course, as we expect most (not all) to fall within the range.  Indeed, if Sales is Normally distributed we would expect to see approx. 68% of test data set observations to fall within +/- 2.037 of the predicted value and 95% of test data set observations to fall within +/- 4.074 (2 x 2.037).  So let’s have a look:
sd = sqrt(mse)
sd1 = ifelse((raw.data$Sales > raw.data$predicted-sd) & (raw.data$Sales < raw.data$predicted+sd), 1, 0)
sum(sd1) / nrow(raw.data)
[1] 0.68
sd2 = ifelse((raw.data$Sales > raw.data$predicted-sd*2) & (raw.data$Sales < raw.data$predicted+sd*2), 1, 0)
sum(sd2) / nrow(raw.data)
[1] 0.945
Which is pretty well spot on for a normally distributed Sales statistic.
  1. Use cross-validation in order to determine the optimal level of tree complexity.  Does pruning the tree improve the test MSE?
cv = cv.tree(tree)
plot(cv$size, cv$dev/nrow(Carseats), type="b", pch=19, col="#562435",
  xlab="Tree Size (# Terminal Nodes)", ylab="MSE",
  main="Cross Validation: Effect of Tree Size on MSE", ylim=c(2.6,2.8))

Cross validation indicates that the minimum MSE is when the trees size is 9 or 10.
pruned.tree = prune.tree(tree, best=9)
pruned.mse = calculate_mse(pruned.tree, test)
pruned.mse
[1] 4.993124
However, the above calculation shows that pruning the tree does not improve the MSE em>of the Test data set.  Let’s look at the effect of pruning on the test MSE:
pruned.test.mse = rep(0,17)
for (i in 2:18) {
  pruned.tree = prune.tree(tree, best=i)
  pruned.mse = calculate_mse(pruned.tree, test)
  pruned.test.mse[i] = pruned.mse
}
plot(pruned.test.mse, type="b", pch=19, col="#562435", 
     xlab="Pruned Tree Size (# Terminal Nodes)", ylab="Test MSE", 
     main="Effect of Pruning on Test MSE", xlim=c(3,18), ylim=c(4,6))

The above chart shows that tree pruning actually increases the MSE of the Test data set.  The conflicting signals from the two data sets (CV was done using the Training data set) is characteristic of the (single) Decision Tree method which can have high variance.  Better approaches - such as Bagging, Boosting and Random Forests - address this problem by not relying upon a single tree and, instead, use ‘averaging’ of multiple trees in their approach.
  1. Use the bagging approach in order to analyze this data.  What test MSE do you obtain?  Use the importance() function to determine which variables are most important.
library(randomForest)
set.seed(3)
bagged.tree = randomForest(Sales ~ ., data=training, importance=TRUE, mtry=11)
bagged.mse = calculate_mse(bagged.tree, test)
bagged.mse
[1] 2.55751
Using bagging has reduced the test MSE substantially.  With a single tree (of 18 leaves) our test MSE was 4.148897.  Now, with bagging, it has fallen 38.4% to 2.55751.
The downside of bagging, is that there is no single tree with which we can identify the key predictor variables.  There are 500 trees and each one of our bagged, predicted values represents the average of 500 individual predictions from those trees.  Instead, for bagged results, we can use the importance() function to identify the key predictor variables:
importance(bagged.tree, type=1)
               %IncMSE
CompPrice   14.3725974
Income       5.9039670
Advertising 14.2546456
Population   0.5521805
Price       61.0815445
ShelveLoc   46.5308006
Age         20.9071380
Education    1.3675478
Urban       -2.7169457
US           6.8176330
We can now see the importance of each predictor to the MSE.  Those with a larger %IncMSE value reduce MSE the most.  Since MSE is a measure of the spread of predicted variables around the Actual values, this means those variables have the greater explanatory power as inclusion of these variables results in the greatest reduction in this spread.  Clearly, the most important variables are (in order of importance, greatest to least):
       values         ind
1  61.0815445       Price
2  46.5308006   ShelveLoc
3  20.9071380         Age
4  14.3725974   CompPrice
5  14.2546456 Advertising
6   6.8176330          US
7   5.9039670      Income
8   1.3675478   Education
9   0.5521805  Population
10 -2.7169457       Urban
  1. Use random forests to analyze this data.  What test MSE do you obtain?  Use the importance() function to determine which variables are most important.  Describe the effect of m, the number of variables considered at each split, on the test MSE obtained.
We can examine the effect on MSE from the above strong predictors (Price, ShelveLoc, etc.) by letting the bagging process randomly select a subset of these predictors when building each tree - a process known as a Random Forest.
random.forest.mse = rep(0, 11)
for (i in 1:11) {
  set.seed(3)
  random.forest = randomForest(Sales ~ ., data=training, importance=TRUE, mtry=i)
  random.forest.mse[i] = calculate_mse(random.forest, test)
}
plot(random.forest.mse, 
  type="b", 
  pch=19,
  col="#562435",
  xlab="Number of Predictors",
  ylab="MSE",
  main="Random Forest Results"
  ,ylim=c(2.5, 3.0)
)


set.seed(3)
random.forest.9 = randomForest(Sales ~ ., data=training, mtry=9, importance=TRUE)
random.forest.9.mse = calculate_mse(random.forest, test)
random.forest.9.mse
[1] 2.55751
stack(sort(importance(random.forest.9, type=1)[,1], decreasing=TRUE))
       values         ind
1  56.8881626       Price
2  43.7826458   ShelveLoc
3  21.1565415         Age
4  14.8210193 Advertising
5  13.9965137   CompPrice
6   5.9415088          US
7   5.2824015      Income
8   1.3512694   Education
9   0.5676259  Population
10 -2.9272455       Urban
From the above we can see:
  1. If we allow the random forest to choose from 9 predictors, instead of 11, we get the same MSE;
  2. If we limit this number of predictors below 9, the MSE increases; and
  3. The importance of each predictor is roughly the same - although the Random Forest reports lower coefficients for the most important predictors and has changed the order of importance slightly for some of them, eg. Age and Advertising.

Question 3 - Classification Trees

This problem involves the OJ data set which is part of the ISLR package.
  1. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations
library(ISLR)
OJ = na.omit(OJ)
n = nrow(OJ)
set.seed(1)
idx = sample(n, 800)
training = OJ[idx,]
test = OJ[-idx,]
  1. Fit a tree to the training data, with Purchase as the response and the other variables as predictors.  Use the summary() function to produce summary statistics about the tree, and describe the results obtained.  What is the training error rate?  How many terminal nodes does the tree have?
library(tree)
tree = tree(Purchase ~ ., data=training)
summary(tree)

Classification tree:
tree(formula = Purchase ~ ., data = training)
Variables actually used in tree construction:
[1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
Number of terminal nodes:  8 
Residual mean deviance:  0.7305 = 578.6 / 792 
Misclassification error rate: 0.165 = 132 / 800 
Of the 17 predictors, only 4 are used: LoyalCH, PriceDiff, SpecialCH, ListPriceDiff.  These 17 give a tree with a training error rate of 16.5% and 8 terminal nodes.
  1. Type in the name of the tree object in order to get a detailed text output.  Pick one of the terminal nodes, and interpret the information displayed.
tree
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 800 1064.00 CH ( 0.61750 0.38250 )  
   2) LoyalCH < 0.508643 350  409.30 MM ( 0.27143 0.72857 )  
     4) LoyalCH < 0.264232 166  122.10 MM ( 0.12048 0.87952 )  
       8) LoyalCH < 0.0356415 57   10.07 MM ( 0.01754 0.98246 ) *
       9) LoyalCH > 0.0356415 109  100.90 MM ( 0.17431 0.82569 ) *
     5) LoyalCH > 0.264232 184  248.80 MM ( 0.40761 0.59239 )  
      10) PriceDiff < 0.195 83   91.66 MM ( 0.24096 0.75904 )  
        20) SpecialCH < 0.5 70   60.89 MM ( 0.15714 0.84286 ) *
        21) SpecialCH > 0.5 13   16.05 CH ( 0.69231 0.30769 ) *
      11) PriceDiff > 0.195 101  139.20 CH ( 0.54455 0.45545 ) *
   3) LoyalCH > 0.508643 450  318.10 CH ( 0.88667 0.11333 )  
     6) LoyalCH < 0.764572 172  188.90 CH ( 0.76163 0.23837 )  
      12) ListPriceDiff < 0.235 70   95.61 CH ( 0.57143 0.42857 ) *
      13) ListPriceDiff > 0.235 102   69.76 CH ( 0.89216 0.10784 ) *
     7) LoyalCH > 0.764572 278   86.14 CH ( 0.96403 0.03597 ) *
Looking at the terminal node marked ‘7)’, we can tell that there are 278 observations that fall into this region, being the region represented by customers whose loyalty to the Citrus Hill brand is above 0.764527 (note the ‘yval’ is ‘CH’).  This terminal node suggests customers with a loyalty to Citrus Hill of above 0.764527 will always purchase Citrus Hill Orange Juice.
  1. Create a plot of the tree, and interpret the results.
plot(tree, col="grey")
text(tree, pretty=0, col="red")

Clearly, the extent of loyalty to the Citrus Hill brand (“LoyalCH”) has a major impact on purchasing decisions.  If this loyalty is rated above 0.508643, then a purchase of Citrus Hill is predicted.  A purchase of Minute Maid is only predicted when either (a) loyalty to Citrus Hill is rated less than 0.264232, or (b) when loyalty is between 0.264232 and 0.508643, and the price difference between the two products is less than 0.195 and Minute Maid is on special.

The cases where the terminal nodes under a branch have the same predicted value (ie. MM and MM, or CH and CH) illustrate node purity.  Node purity, in effect, is the difference between an outcome being “highly likely” or just “probable”.  The more a terminal node has values all of the same category, the more “pure” is the node, such that if a prediction terminates at a node that is highly pure, it is highly likely the actual outcome will be as predicted.  Whereas, a prediction that falls into a less pure node is, on the balance of probabilities, going to result in an actual outcome as predicted, but there is a not insignificant chance the actual outcome will be the opposite of that predicted.
  1. Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels.  What is the test error rate?
actual = test$Purchase
predicted = ifelse(predict(tree, newdata=test)[,1] <= 0.5, "MM", "CH")
confusion.matrix = table(actual, predicted)
confusion.matrix
      predicted
actual  CH  MM
    CH 147  12
    MM  49  62
test.error.rate = (confusion.matrix[1,2] + confusion.matrix[2,1]) / nrow(test)
test.error.rate
[1] 0.2259259
The test error rate is 22.6%.
  1. Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv = cv.tree(tree, FUN=prune.misclass)
cv
$size
[1] 8 5 2 1

$dev
[1] 156 156 156 306

$k
[1]       -Inf   0.000000   4.666667 160.000000

$method
[1] "misclass"

attr(,"class")
[1] "prune"         "tree.sequence"
  1. Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv$size, (cv$dev / nrow(training) * 100),
  type="b",
  pch=19,
  col="#562435",
  xlab="Tree Size (# Terminal Nodes)",
  ylab="Error Rate (%)",
  main="Cross-Validation Classification Error Rates for Different Sized Trees"
)

  1. Which tree size corresponds to the lowest cross-validated classification error rate?
The tree with 2 terminal nodes corresponds to the lowest cross-validated classification error rate.
  1. Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation.  If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
pruned.tree = prune.misclass(tree, best=2)
summary(pruned.tree)

Classification tree:
snip.tree(tree = tree, nodes = c(3L, 2L))
Variables actually used in tree construction:
[1] "LoyalCH"
Number of terminal nodes:  2 
Residual mean deviance:  0.9115 = 727.4 / 798 
Misclassification error rate: 0.1825 = 146 / 800 
  1. Compare the training error rates between the pruned and unpruned trees.  Which is higher?
The training error rate for the pruned tree (18.25%) is slighly higher than the unpruned tree (16.5%)
  1. Compare the test error rates between the pruned and unpruned trees.  Which is higher?
actual = test$Purchase
predicted = ifelse(predict(pruned.tree, newdata=test)[,1] <= 0.5, "MM", "CH")
confusion.matrix = table(actual, predicted)
confusion.matrix
      predicted
actual  CH  MM
    CH 119  40
    MM  30  81
pruned.test.error.rate = (confusion.matrix[1,2] + confusion.matrix[2,1]) / nrow(test)
pruned.test.error.rate
[1] 0.2592593
The test error rate for the pruned tree (25.9%) is higher than the unpruned tree (22.6%).

Question 4 - SVM

In the problem, you will use vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
  1. Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
library(e1071)
library(ISLR)
Auto = na.omit(Auto)
mpghigh = ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpghigh = as.factor(mpghigh)
  1. Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage.  Report the cross-validation errors associated with different values on this parmeter.  Comment on your results.
set.seed(1)
tune.linear = tune(svm, mpghigh~., data=Auto, kernel="linear", ranges=list(cost=c(0.01, 0.1, 1, 10, 100)))
linear = summary(tune.linear)
linear

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
    1

- best performance: 0.01275641 

- Detailed performance results:
   cost      error dispersion
1 1e-02 0.07403846 0.05471525
2 1e-01 0.03826923 0.05148114
3 1e+00 0.01275641 0.01344780
4 1e+01 0.02038462 0.01074682
5 1e+02 0.03820513 0.01773427
From the above output we can see the support vector classifier with the lowest cross-validation error is the one with cost = 1, having an error rate of 1.276%.
  1. Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost.  Comment on your results.
Polynomial SVM
set.seed(1)
tune.poly = tune(svm, mpghigh~., data=Auto, kernel="polynomial", ranges=list(cost=c(0.01, 0.1, 1, 10, 100), degree=c(2,3,4)))
poly = summary(tune.poly)
poly

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost degree
  100      2

- best performance: 0.3013462 

- Detailed performance results:
    cost degree     error dispersion
1  1e-02      2 0.5611538 0.04344202
2  1e-01      2 0.5611538 0.04344202
3  1e+00      2 0.5611538 0.04344202
4  1e+01      2 0.5382051 0.05829238
5  1e+02      2 0.3013462 0.09040277
6  1e-02      3 0.5611538 0.04344202
7  1e-01      3 0.5611538 0.04344202
8  1e+00      3 0.5611538 0.04344202
9  1e+01      3 0.5611538 0.04344202
10 1e+02      3 0.3322436 0.11140578
11 1e-02      4 0.5611538 0.04344202
12 1e-01      4 0.5611538 0.04344202
13 1e+00      4 0.5611538 0.04344202
14 1e+01      4 0.5611538 0.04344202
15 1e+02      4 0.5611538 0.04344202

The above results show the lowest cross-validation error occurs at a cost of 100 with a polynomial of degree 2.  However, compared to the results in (b), the cross-validation error of this ‘best’ polynomial SVM is approx. 24x that of the linear SVM (30.135% vs. 1.276%).

Radial SVM
set.seed(1)
tune.radial = tune(svm, mpghigh~., data=Auto, kernel="radial", ranges=list(cost=c(0.01, 0.1, 1, 10, 100), gamma=c(0.01,0.1,1,10)))
radial = summary(tune.radial)
radial

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
  100  0.01

- best performance: 0.01532051 

- Detailed performance results:
    cost gamma      error dispersion
1  1e-02  0.01 0.56115385 0.04344202
2  1e-01  0.01 0.09185897 0.03862507
3  1e+00  0.01 0.07147436 0.05103685
4  1e+01  0.01 0.02551282 0.03812986
5  1e+02  0.01 0.01532051 0.01788871
6  1e-02  0.10 0.19153846 0.07612945
7  1e-01  0.10 0.07916667 0.05201159
8  1e+00  0.10 0.05608974 0.05092939
9  1e+01  0.10 0.02551282 0.02076457
10 1e+02  0.10 0.02807692 0.01458261
11 1e-02  1.00 0.56115385 0.04344202
12 1e-01  1.00 0.56115385 0.04344202
13 1e+00  1.00 0.06634615 0.06187383
14 1e+01  1.00 0.06128205 0.06186124
15 1e+02  1.00 0.06128205 0.06186124
16 1e-02 10.00 0.56115385 0.04344202
17 1e-01 10.00 0.56115385 0.04344202
18 1e+00 10.00 0.51775641 0.04471079
19 1e+01 10.00 0.51012821 0.03817175
20 1e+02 10.00 0.51012821 0.03817175
For the radial SVM, the lowest cross-validation error comes at a cost of 100 with a gamma of 0.01.  Compared to the linear SVM in (b), the cross-validation error of this ‘best’ radial SVM is approx. 1x that of the linear SVM (1.532% vs. 1.276%).
The above comparison become somewhat easier if we plot the results:
plot(linear$performances[,c("cost","error")], type="l", ylim=c(0,0.6), main="CV results for different SVMs")
lines(poly$performances[poly$performances["degree"]==2, c("cost", "error")], col=2)
lines(poly$performances[poly$performances["degree"]==3, c("cost", "error")], col=3)
lines(poly$performances[poly$performances["degree"]==4, c("cost", "error")], col=4)
lines(radial$performances[radial$performances["gamma"]==0.01, c("cost", "error")], col=5)
lines(radial$performances[radial$performances["gamma"]==0.1 , c("cost", "error")], col=6)
lines(radial$performances[radial$performances["gamma"]==1   , c("cost", "error")], col=7)
lines(radial$performances[radial$performances["gamma"]==10  , c("cost", "error")], col=8)
lines(radial$performances[radial$performances["gamma"]==100 , c("cost", "error")], col=9)
legend(5, 0.5,
  c("Linear", "Polynomial (degree=2)", "Polynomial (degree=3)", "Polynomial (degree=4)","Radial (gamma=0.01)","Radial (gamma=0.1)","Radial (gamma=1)","Radial (gamma=10)","Radial (gamma=100)"), 
  lty=rep("solid", 9),
  col=c(1,2,3,4,5,6,7,8,9),
  bty="n"
)

Here we can see that the Linear classifier gives the best cross-validation performance: it offers the lowest error rate with the least cost.  We can also see that we get similar performance (to the Linear case) from low-gamma Radial classifiers, if we are willing to accept greater cost.  As the value of cost increases, these low-gamma classifiers outperform the Linear classifier.  Although it is difficult to see from the above chart, the gamma=0.01 Radial classifier starts to outperform the Linear classifier when cost is above 30 (approx.) and the gamma=0.1 Radial classifier does better than the Linear classifier when cost is above 40 (approx.).  Finally, we can see that the Polynomial classifiers are not very useful as they require very high values of cost to get their performance anywhere near the Linear classifier.
  1. Make some plots to back up your assertions in (b) and (c).

    Hint: In the lab, we used the plot() function for svm objects only in cases with \(p\) = 2.  When \(p\) > 2 you can use the plot() function to create plots displaying pairs of variables at a time.  Essentially, instead of typing plot(svmfit, dat) where svmfit contains your fitted model and dat is a data frame containing your data, you can type plot(svmfit, at, x1 ~ x4) in order to plot just the first and fourth variables.  However, you must replace x1 and x4 with the correct variable names.  To find out more type ?plot.svm.
The below charts illustrate the ‘best performance’ case for each of the Linear, Radial and Polynomial SVMs as determined in (b) and (c) above.  As there are more than two predictor variables (horsepower, acceleration, etc), visualising each SVM in a single chart is not possible.  Instead, a series of 2-dimensional charts have been plotted.
The background of each chart has different shading to highlight the position of the decision boundary.  As can be seen, the decision boundary is clearly shown in all of the Linear charts, whereas the Polynomial charts fail to illustrate this demarcation confirming what was stated above that the Polynomial SVM is not a good model for the Auto data.  Some of the Radial charts illustrate the decision boundary and we can see it is non-linear although the shape of the decision boundary in the charts for displacement and weight, suggests the non-linearity is approaching linearity.  The Radial decision boundary is not visible in the chart for horsepower.  This combination of charts also supports the conclusion made in (c) above, that the Radial model is better than the Polynomial, but still inferior to the Linear.
Linear Charts

Polynomial Charts

Radial Charts

Question 5 - SVM

Here we explore the maximal margin classifier on a toy data set.
  1. We are given \(n\) = 7 observations in \(p\) = 2 dimensions.  For each observation, there is an associated class label.  Sketch the observations.
X1 = c(3,2,4,1,2,4,4)
X2 = c(4,2,4,4,1,3,1)
Y = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue")
plot(X1, X2, col=Y, pch=19)

  1. Sketch the optimal separating hyperplane and provide the equation for this hyperplane of the following form.
\(\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\)


df = data.frame(X1, X2, y=as.factor(Y))
library(e1071)
mmc.fit = svm(y~., data=df, scale=FALSE, kernel="linear", cost=10)
beta = round(t(mmc.fit$coefs) %*% as.matrix(df[mmc.fit$index,1:2]),3)
beta0 = round(mmc.fit$rho, 3)
mmc.intercept = beta0 / beta[2]
mmc.gradient = -beta[1] / beta[2]
plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, lty=5)
legend(1, 4,
  c("Maximal Margin Hyperplane"), 
  lty=c("dashed"),
  lwd=c(2),
  bty="n"
)

The hyperplane has the equation:


\(0.5 - X_1 + X_2 = 0\)


which, rearranging, gives us the equation in (\(X_1,X_2\)) space:


\(X_2 = X_1 -0.5\)


  1. Describe the classification rule for the maximal margin classifier.  It should be something along the lines of “Classify to Red if \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 \gt 0\) and classify to Blue otherwise.”  Provide the values for \(\beta_0\), \(\beta_1\) and \(\beta_2\).
Classify as Red if \(0.5 - X_1 + X_2 > 0\) and Blue otherwise.
  1. On your sketch indicate the margin for the maximal margin hyperplane.
plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5)
abline((beta0+1)/beta[2], mmc.gradient, lty=5)
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin"), 
  lty=c("solid", "dashed"),
  lwd=c(1, 2),
  col=c("lightgrey", "black", "red"),
  bty="n"
)

  1. Indicate the support vectors for the maximal margin classifier.
plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
abline((beta0+1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
points(df[mmc.fit$index,], pch=1, cex=3, col=df[mmc.fit$index,3])
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin", "Support Vector"), 
  lty=c("solid", "dashed", NA),
  lwd=c(1, 1, 1),
  pch=c(NA,NA,1),
  col=c("lightgrey", "lightgrey", "black"),
  bty="n"
)

The support vectors are points 2, 3, 6:
df[mmc.fit$index,1:2]
  X1 X2
2  2  2
3  4  4
6  4  3
  1. Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.
The 7th observation (4,1) is not a support vector and is far away from both the hyperplane and its margin. Consequently, any slight movement in its position will not affect the hyperplance (nor its margin).
  1. Sketch a hyperplane that is not the optimal separating hyperplance and provide the equation for this hyperplane.
plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
abline((beta0+1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
points(df[mmc.fit$index,], pch=1, cex=3, col="lightgrey")
so.mmc.fit = svm(y~., data=df, scale=FALSE, kernel="linear", cost=1)
so.beta = round(t(so.mmc.fit$coefs) %*% as.matrix(df[so.mmc.fit$index,1:2]),3)
so.beta0 = round(so.mmc.fit$rho, 3)
so.mmc.intercept = round(so.beta0   / so.beta[2], 3)
so.mmc.gradient = round(-so.beta[1] / so.beta[2], 3)
abline(so.mmc.intercept, so.mmc.gradient)
abline((so.beta0-1)/so.beta[2], so.mmc.gradient, lty=5)
abline((so.beta0+1)/so.beta[2], so.mmc.gradient, lty=5)
points(df[so.mmc.fit$index,], pch=1, cex=3, col=df[so.mmc.fit$index,3])
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin", "Support Vector", "Sub-optimal Maximal Margin Hyperplane", "Sub-optimal Margin", "Sub-optimal Support Vector"), 
  lty=c("solid", "dashed", NA, "solid", "dashed", NA),
  lwd=c(1, 1, NA, 1, 1, NA),
  pch=c(NA, NA, 1, NA, NA, 1),
  col=c("lightgrey", "lightgrey", "lightgrey", "black", "black", "black"),
  bty="n"
)

The above hyperplane is sub-optimal because (a) it relies on more support vectors (4 vs. 3 previously), and (b) it has a wider margin so that some observations fall within the margin.

The sub-optimal hyperplane has the equation:


\(-0.846\) \(-0.615X_1 + 1.077X_2 = 0\)


which, rearranging, gives us the equation in (\(X_1,X_2\)) space:


\(X_2 = 0.571X_1 + 0.786\)

  1. Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.
plot(X1, X2, col=Y, pch=19)
points(4, 2, pch=19, col="red")
points(4, 2, pch=1, cex=3, col="red")

Here we can see that the new observation (4,2) results in the inability to draw a hyperplane that separates the reds from the blues.

Question 6 - Hierarchical Clustering

Consider the USArrests data.  We will now perform hierarchical clustering on the states.
  1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
library(ISLR)
set.seed(1)
hc.complete = hclust(dist(USArrests), method="complete")
plot(hc.complete)

  1. Cut the dendogram at a height that results in three distinct clusters.  Which states belong to which clusters?
hc.cut = cutree(hc.complete, 3)

Cluster #1:

subset(names(hc.cut), hc.cut==1)
 [1] "Alabama"        "Alaska"         "Arizona"        "California"    
 [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
 [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
[13] "New Mexico"     "New York"       "North Carolina" "South Carolina"

Cluster #2:

subset(names(hc.cut), hc.cut==2)
 [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
 [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
 [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
[13] "Washington"    "Wyoming"      

Cluster #3:

subset(names(hc.cut), hc.cut==3)
 [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
 [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
 [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
[13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
[17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"    
  1. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
hc.complete.scaled = hclust(dist(scale(USArrests)), method="complete")
plot(hc.complete.scaled)

hc.cut.scaled = cutree(hc.complete.scaled, 3)

Cluster #1:

subset(names(hc.cut.scaled), hc.cut.scaled==1)
[1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
[5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     

Cluster #2:

subset(names(hc.cut.scaled), hc.cut.scaled==2)
 [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
 [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
[11] "Texas"     

Cluster #3:

subset(names(hc.cut.scaled), hc.cut.scaled==3)
 [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
 [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
 [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
[13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
[17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
[21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
[25] "Utah"          "Vermont"       "Virginia"      "Washington"   
[29] "West Virginia" "Wisconsin"     "Wyoming"      
  1. What effect does scaling the variables have on the hierarchical clustering obtained?  In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed?  Provide a justification for your answer.
We can see the effect that scaling has on the clustering by looking at the Confusion Matrix (scaled vs. non-scaled):
table(hc.cut.scaled, hc.cut)
             hc.cut
hc.cut.scaled  1  2  3
            1  6  2  0
            2  9  2  0
            3  1 10 20
Scaling the data before clustering has resulted in the reclassification of some of the US States into a different cluster than that which was obtained using the raw, unscaled data.  For example, California was previously in Cluster #1 but is now in Cluster #2 and Georgia was previously in Cluster #2 but is now in Cluster #1.
Scaling data removes the effect that differences in measurement units can have on the dissimilarity measure used.  By transforming each variable from its raw value, to the number of standard deviations that value is from the mean of the values for that variable, the scale() function levels each variable to the same scale, so that the distance measure (eg. Euclidean distance) is not unduly influenced by variables that have a differenct scale.  In other words, under scaling, each variable will be given equal importance in the clustering algorithm.  For example, if the distance on one dimension between two data points is measured in metres and that distance is 100m, while all of the other variables are measure in kilometres and their distances range between 0.1km and 1.0km, then we can see that the variable measured in metres will have undue prominence in the Euclidean distance calculation and, as a result, the clustering algorithm will tend towards emphasising clusters based on their ‘metre’ predictor.  This may, or may not, be desireable.
The question of whether or not scaling is desireable, therefore, depends on the underlying data and the nature of the question being asked of that data.  For example, in the case of retail shopping data where the data set consists of a large number of low-value items and small number of high-value items, without scaling, the high-volume/low-value transactions may take greater importance in clustering shoppers.  This unscaled data would be aptly suited to clustering individuals into High, Medium and Low item shoppers, however, it would be less than suitable if the desire was to cluster individuals by the value of purchases made.  In the latter case, scaling the data would remove the high-volume/low-value bias in the raw data.
In the case of the USArrests data set, the variables are measured in different units: Murder, Rape and Assault are measured as the number of occurences per 100,000 people, while UrbanPop is measured as the percentage of a State’s population that lives in Urban (as opposed to Rural) areas.  The individual variances of these raw variables are:
apply(USArrests, 2, var)
    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916 
As Assault has a very large variance compare to the others, our clustering algorithm (which uses Euclidean distance and Complete linkage) will give preference to this variable and our clusters will be biased towards grouping States with High, Mediam and Low Assault crime rates.  If we scale, the variables, however:
apply(scale(USArrests), 2, var)
  Murder  Assault UrbanPop     Rape 
       1        1        1        1 
then each variable has equal importance.  Thus, if we are interested in separating States into clusters of High, Mediam and Low overall crime rates, then scaling the data is desirable.

Question 7 - PCA and K-Means Clustering

In this problem, you will generate simulated data and then perform PCA and K-Means Clustering on the data.
  1. Generate a simulated data set with 20 observations in each of these classes (ie. 60 observations total) and 50 variables.

    Hint: There are a number of functions in R that you can use to generate data.  One example is the rnorm() function; runif() is another option.  Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
set.seed(1)
x1 = matrix(rnorm(20 * 50, mean=0), ncol=50)
set.seed(2)
x2 = matrix(rnorm(20 * 50, mean=1), ncol=50)
set.seed(3)
x3 = matrix(rnorm(20 * 50, mean=2), ncol=50)
x = rbind(x1, x2, x3)
sample.id = c(rep(1, 20), rep(2, 20), rep(3, 20))
plot(x, col=sample.id, pch=20)

The above plot of the first two (out of 50) variables clearly shows each sample has a good co-mingling with the other samples such that colour-coding is needed to clearly identify the three samples, even though each sample is centered around a different mean.
  1. Perform PCA on the 60 obserations and plot the first two principal components’ eigenvector.  Use a different colour to indicate the observations in each of the three classes.  If the three classes appear separated in this plot, then continue on to part (c).  If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes.  Do not continue to part (c) until the three classes show at least some separation in the first two principal component eigenvectors.
plot(prcomp(x)$x[,1:2], pch=20, col=sample.id)

Using the first two principal components (as opposed ot the first two variables - ie. (a) above) provides a clearer separation of the three samples.
  1. Perform K-Means Clustering of the observations with K = 3.  How well do the clusters that you obtained in K-Means Clustering compare to the true class labels?

    Hint:You can use the table() function in R to compare the true class labels to the class labels obtained by clustering.  Be careful how you interpret the results: K-Means Clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
cluster = kmeans(x, 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  0  0 20
      2  0 20  0
      3 20  0  0
The above table shows a perfect clustering of all three samples.  Each sample falls into its own, separate cluster.
  1. Perform K-Means Clustering with K = 2.  Describe your results.
cluster = kmeans(x, 2, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1 20  0  0
      2  0 20 20
With K=2, both Sample IDs 2 and 3 are clustered together.
  1. Now perform K-Means Clustering with K = 4 and describe your results.
cluster = kmeans(x, 4, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  9  0  0
      2  0 20  0
      3 11  0  0
      4  0  0 20
With K=4, Sample ID 1 has been split into two separate clusters.
  1. Now perform K-Means Clustering with K = 3 on the first two principal components rather than on the raw data.  That is, perform K-Means Clustering on the 60 x 2 matrix of which the first column is the the first principal component’s corresponding eigenvector and the second column is the second principal component’s corresponding eigenvector.  Comment on the results.
cluster = kmeans(prcomp(x)$x[,1:2], 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1 20  0  0
      2  0 20  0
      3  0  0 20
Using the Principal Components, we get the same results in (c) above, ie. perfect clustering.
  1. Using the scale() function perform K-Means Clustering with K = 3 on the data after scaling each variable to have standard deviation of one.  How do these results compare to those obtained in (b)?  Explain.
cluster = kmeans(scale(x), 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  0 20  0
      2  0  0 20
      3 20  0  0
Scaling has had no effect on the K=3 case.  All samples are clustered correctly.  This is the expected result.  The scale() function subtracts the mean of each column from each of the values in that column in order to ‘re-centre’ the data.  It then divides each resulting value in that column by the standard deviation of that column so that the final data in each column has a standard deviation of 1.  The way we constructed our data set in (a) was such that, in each column, we have 3 subsets of data - all with the same standard deviation but with different means.  Subtracting the mean of the whole column from these ‘subset’ means does not change the differences between the subset means.  Consequently, there is still sufficient differences between the subsets such that the K-Means algorithm will still cluster the same data into the same clusters irrespective of whether or not the data has been scaled.