Big Data Analytics - Coursework 2

Question

Sketch the tree corresponding to the partition of the particular space illustrated in the left-hand panel of the figure above. The numbers inside the box indicate the mean of Y within each region.

Answer 1

The above tree suggests the most significant influence on sales volume is product placement, viz. the relative quality of the shelving location at the store where the carseat was sold (ie. ShelvLoc). Those stores where ShelvLoc was described as ‘Good’ resulted in higher average sales than stores where ShelvLoc was ‘Medium’ or ‘Bad’.

Price had the next most important influence on Sales. Then came - depending on the price level - either CompPrice or Age as the next most important influence. In total, the tree indicates 6 factors (out of a total of 10) influence Sales:

ShelveLoc, Price, Age, Advertising, Income, CompPrice

Answer 2

The Test data set has a mean standard error (MSE) of 4.148897. Consequently, we would expect most observations to have their Sales value within the range of 2.037 either side of the value shown for the terminal node in the above tree for which the observation falls.

We can verify this by looking at the raw Test data:

Answer 3

For the first row we have the value for the predictor variables:

ShelveLoc = Bad
Price = 120
Age = 42
Advertising = 11

which, following the above tree, gives a value of 9.117 (the 4th terminal node from the left). We see this in the ‘predicted’ column for this observation in the table above: 9.11667 (rounded to 3 decimal places = 9.117). According to the above MSE, we expect the true value of Sales to fall between 9.11667-2.037 and 9.11667+2.037, ie. between 7.07967 and 11.15367. As we see from the above table, it does: 7.07967 < 9.5 < 11.15367.

However, Sales for the 3rd observation (10.06) falls outside of the predicted range [5.47522, 9.54922]. This can happen, of course, as we expect most (not all) to fall within the range. Indeed, if Sales is Normally distributed we would expect to see approx. 68% of test data set observations to fall within +/- 2.037 of the predicted value and 95% of test data set observations to fall within +/- 4.074 (2 x 2.037). So let’s have a look:

Answer 4

Which is pretty well spot on for a normally distributed Sales statistic.

Answer 5

Cross validation indicates that the minimum MSE is when the trees size is 9 or 10.

Answer 6

However, the above calculation shows that pruning the tree does not improve the MSE em>of the Test data set. Let’s look at the effect of pruning on the test MSE:

Answer 7

The above chart shows that tree pruning actually increases the MSE of the Test data set. The conflicting signals from the two data sets (CV was done using the Training data set) is characteristic of the (single) Decision Tree method which can have high variance. Better approaches - such as Bagging, Boosting and Random Forests - address this problem by not relying upon a single tree and, instead, use ‘averaging’ of multiple trees in their approach.

Answer 8

Using bagging has reduced the test MSE substantially. With a single tree (of 18 leaves) our test MSE was 4.148897. Now, with bagging, it has fallen 38.4% to 2.55751.

Answer 9

The downside of bagging, is that there is no single tree with which we can identify the key predictor variables. There are 500 trees and each one of our bagged, predicted values represents the average of 500 individual predictions from those trees. Instead, for bagged results, we can use the importance() function to identify the key predictor variables:

Answer 10

We can now see the importance of each predictor to the MSE. Those with a larger %IncMSE value reduce MSE the most. Since MSE is a measure of the spread of predicted variables around the Actual values, this means those variables have the greater explanatory power as inclusion of these variables results in the greatest reduction in this spread. Clearly, the most important variables are (in order of importance, greatest to least):

Answer 11

We can examine the effect on MSE from the above strong predictors (Price, ShelveLoc, etc.) by letting the bagging process randomly select a subset of these predictors when building each tree - a process known as a Random Forest.

Answer 12

From the above we can see:

If we allow the random forest to choose from 9 predictors, instead of 11, we get the same MSE;
If we limit this number of predictors below 9, the MSE increases; and
The importance of each predictor is roughly the same - although the Random Forest reports lower coefficients for the most important predictors and has changed the order of importance slightly for some of them, eg. Age and Advertising.

Answer 13

Of the 17 predictors, only 4 are used: LoyalCH, PriceDiff, SpecialCH, ListPriceDiff. These 17 give a tree with a training error rate of 16.5% and 8 terminal nodes.

Answer 14

Looking at the terminal node marked ‘7)’, we can tell that there are 278 observations that fall into this region, being the region represented by customers whose loyalty to the Citrus Hill brand is above 0.764527 (note the ‘yval’ is ‘CH’). This terminal node suggests customers with a loyalty to Citrus Hill of above 0.764527 will always purchase Citrus Hill Orange Juice.

Answer 15

Clearly, the extent of loyalty to the Citrus Hill brand (“LoyalCH”) has a major impact on purchasing decisions. If this loyalty is rated above 0.508643, then a purchase of Citrus Hill is predicted. A purchase of Minute Maid is only predicted when either (a) loyalty to Citrus Hill is rated less than 0.264232, or (b) when loyalty is between 0.264232 and 0.508643, and the price difference between the two products is less than 0.195 and Minute Maid is on special.

The cases where the terminal nodes under a branch have the same predicted value (ie. MM and MM, or CH and CH) illustrate node purity. Node purity, in effect, is the difference between an outcome being “highly likely” or just “probable”. The more a terminal node has values all of the same category, the more “pure” is the node, such that if a prediction terminates at a node that is highly pure, it is highly likely the actual outcome will be as predicted. Whereas, a prediction that falls into a less pure node is, on the balance of probabilities, going to result in an actual outcome as predicted, but there is a not insignificant chance the actual outcome will be the opposite of that predicted.

Answer 16

The test error rate is 22.6%.

Answer 17

The tree with 2 terminal nodes corresponds to the lowest cross-validated classification error rate.

Answer 18

The training error rate for the pruned tree (18.25%) is slighly higher than the unpruned tree (16.5%)

Answer 19

The test error rate for the pruned tree (25.9%) is higher than the unpruned tree (22.6%).

Answer 20

From the above output we can see the support vector classifier with the lowest cross-validation error is the one with cost = 1, having an error rate of 1.276%.

Answer 21

The above results show the lowest cross-validation error occurs at a cost of 100 with a polynomial of degree 2. However, compared to the results in (b), the cross-validation error of this ‘best’ polynomial SVM is approx. 24x that of the linear SVM (30.135% vs. 1.276%).

Radial SVM

set.seed(1)
tune.radial = tune(svm, mpghigh~., data=Auto, kernel="radial", ranges=list(cost=c(0.01, 0.1, 1, 10, 100), gamma=c(0.01,0.1,1,10)))
radial = summary(tune.radial)
radial

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
  100  0.01

- best performance: 0.01532051 

- Detailed performance results:
    cost gamma      error dispersion
1  1e-02  0.01 0.56115385 0.04344202
2  1e-01  0.01 0.09185897 0.03862507
3  1e+00  0.01 0.07147436 0.05103685
4  1e+01  0.01 0.02551282 0.03812986
5  1e+02  0.01 0.01532051 0.01788871
6  1e-02  0.10 0.19153846 0.07612945
7  1e-01  0.10 0.07916667 0.05201159
8  1e+00  0.10 0.05608974 0.05092939
9  1e+01  0.10 0.02551282 0.02076457
10 1e+02  0.10 0.02807692 0.01458261
11 1e-02  1.00 0.56115385 0.04344202
12 1e-01  1.00 0.56115385 0.04344202
13 1e+00  1.00 0.06634615 0.06187383
14 1e+01  1.00 0.06128205 0.06186124
15 1e+02  1.00 0.06128205 0.06186124
16 1e-02 10.00 0.56115385 0.04344202
17 1e-01 10.00 0.56115385 0.04344202
18 1e+00 10.00 0.51775641 0.04471079
19 1e+01 10.00 0.51012821 0.03817175
20 1e+02 10.00 0.51012821 0.03817175

For the radial SVM, the lowest cross-validation error comes at a cost of 100 with a gamma of 0.01. Compared to the linear SVM in (b), the cross-validation error of this ‘best’ radial SVM is approx. 1x that of the linear SVM (1.532% vs. 1.276%).

The above comparison become somewhat easier if we plot the results:

plot(linear$performances[,c("cost","error")], type="l", ylim=c(0,0.6), main="CV results for different SVMs")
lines(poly$performances[poly$performances["degree"]==2, c("cost", "error")], col=2)
lines(poly$performances[poly$performances["degree"]==3, c("cost", "error")], col=3)
lines(poly$performances[poly$performances["degree"]==4, c("cost", "error")], col=4)
lines(radial$performances[radial$performances["gamma"]==0.01, c("cost", "error")], col=5)
lines(radial$performances[radial$performances["gamma"]==0.1 , c("cost", "error")], col=6)
lines(radial$performances[radial$performances["gamma"]==1   , c("cost", "error")], col=7)
lines(radial$performances[radial$performances["gamma"]==10  , c("cost", "error")], col=8)
lines(radial$performances[radial$performances["gamma"]==100 , c("cost", "error")], col=9)
legend(5, 0.5,
  c("Linear", "Polynomial (degree=2)", "Polynomial (degree=3)", "Polynomial (degree=4)","Radial (gamma=0.01)","Radial (gamma=0.1)","Radial (gamma=1)","Radial (gamma=10)","Radial (gamma=100)"), 
  lty=rep("solid", 9),
  col=c(1,2,3,4,5,6,7,8,9),
  bty="n"
)

Here we can see that the Linear classifier gives the best cross-validation performance: it offers the lowest error rate with the least cost. We can also see that we get similar performance (to the Linear case) from low-gamma Radial classifiers, if we are willing to accept greater cost. As the value of cost increases, these low-gamma classifiers outperform the Linear classifier. Although it is difficult to see from the above chart, the gamma=0.01 Radial classifier starts to outperform the Linear classifier when cost is above 30 (approx.) and the gamma=0.1 Radial classifier does better than the Linear classifier when cost is above 40 (approx.). Finally, we can see that the Polynomial classifiers are not very useful as they require very high values of cost to get their performance anywhere near the Linear classifier.

Make some plots to back up your assertions in (b) and (c).

Hint: In the lab, we used the plot() function for svm objects only in cases with \(p\) = 2. When \(p\) > 2 you can use the plot() function to create plots displaying pairs of variables at a time. Essentially, instead of typing plot(svmfit, dat) where svmfit contains your fitted model and dat is a data frame containing your data, you can type plot(svmfit, at, x1 ~ x4) in order to plot just the first and fourth variables. However, you must replace x1 and x4 with the correct variable names. To find out more type ?plot.svm.

The below charts illustrate the ‘best performance’ case for each of the Linear, Radial and Polynomial SVMs as determined in (b) and (c) above. As there are more than two predictor variables (horsepower, acceleration, etc), visualising each SVM in a single chart is not possible. Instead, a series of 2-dimensional charts have been plotted.

The background of each chart has different shading to highlight the position of the decision boundary. As can be seen, the decision boundary is clearly shown in all of the Linear charts, whereas the Polynomial charts fail to illustrate this demarcation confirming what was stated above that the Polynomial SVM is not a good model for the Auto data. Some of the Radial charts illustrate the decision boundary and we can see it is non-linear although the shape of the decision boundary in the charts for displacement and weight, suggests the non-linearity is approaching linearity. The Radial decision boundary is not visible in the chart for horsepower. This combination of charts also supports the conclusion made in (c) above, that the Radial model is better than the Polynomial, but still inferior to the Linear.

Linear Charts

Polynomial Charts

Radial Charts

Question 5 - SVM

Here we explore the maximal margin classifier on a toy data set.

We are given \(n\) = 7 observations in \(p\) = 2 dimensions. For each observation, there is an associated class label. Sketch the observations.

X1 = c(3,2,4,1,2,4,4)
X2 = c(4,2,4,4,1,3,1)
Y = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue")
plot(X1, X2, col=Y, pch=19)

Sketch the optimal separating hyperplane and provide the equation for this hyperplane of the following form.

\(\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\)

df = data.frame(X1, X2, y=as.factor(Y))
library(e1071)
mmc.fit = svm(y~., data=df, scale=FALSE, kernel="linear", cost=10)
beta = round(t(mmc.fit$coefs) %*% as.matrix(df[mmc.fit$index,1:2]),3)
beta0 = round(mmc.fit$rho, 3)
mmc.intercept = beta0 / beta[2]
mmc.gradient = -beta[1] / beta[2]
plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, lty=5)
legend(1, 4,
  c("Maximal Margin Hyperplane"), 
  lty=c("dashed"),
  lwd=c(2),
  bty="n"
)

The hyperplane has the equation:

\(0.5 - X_1 + X_2 = 0\)

which, rearranging, gives us the equation in (\(X_1,X_2\)) space:

\(X_2 = X_1 -0.5\)

Describe the classification rule for the maximal margin classifier. It should be something along the lines of “Classify to Red if \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 \gt 0\) and classify to Blue otherwise.” Provide the values for \(\beta_0\), \(\beta_1\) and \(\beta_2\).

Classify as Red if \(0.5 - X_1 + X_2 > 0\) and Blue otherwise.

On your sketch indicate the margin for the maximal margin hyperplane.

plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5)
abline((beta0+1)/beta[2], mmc.gradient, lty=5)
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin"), 
  lty=c("solid", "dashed"),
  lwd=c(1, 2),
  col=c("lightgrey", "black", "red"),
  bty="n"
)

Indicate the support vectors for the maximal margin classifier.

plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
abline((beta0+1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
points(df[mmc.fit$index,], pch=1, cex=3, col=df[mmc.fit$index,3])
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin", "Support Vector"), 
  lty=c("solid", "dashed", NA),
  lwd=c(1, 1, 1),
  pch=c(NA,NA,1),
  col=c("lightgrey", "lightgrey", "black"),
  bty="n"
)

The support vectors are points 2, 3, 6:

df[mmc.fit$index,1:2]
  X1 X2
2  2  2
3  4  4
6  4  3

Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

The 7th observation (4,1) is not a support vector and is far away from both the hyperplane and its margin. Consequently, any slight movement in its position will not affect the hyperplance (nor its margin).

Sketch a hyperplane that is not the optimal separating hyperplance and provide the equation for this hyperplane.

plot(X1, X2, col=Y, pch=19)
abline(mmc.intercept, mmc.gradient, col="lightgrey")
abline((beta0-1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
abline((beta0+1)/beta[2], mmc.gradient, lty=5, col="lightgrey")
points(df[mmc.fit$index,], pch=1, cex=3, col="lightgrey")
so.mmc.fit = svm(y~., data=df, scale=FALSE, kernel="linear", cost=1)
so.beta = round(t(so.mmc.fit$coefs) %*% as.matrix(df[so.mmc.fit$index,1:2]),3)
so.beta0 = round(so.mmc.fit$rho, 3)
so.mmc.intercept = round(so.beta0   / so.beta[2], 3)
so.mmc.gradient = round(-so.beta[1] / so.beta[2], 3)
abline(so.mmc.intercept, so.mmc.gradient)
abline((so.beta0-1)/so.beta[2], so.mmc.gradient, lty=5)
abline((so.beta0+1)/so.beta[2], so.mmc.gradient, lty=5)
points(df[so.mmc.fit$index,], pch=1, cex=3, col=df[so.mmc.fit$index,3])
legend(1, 4,
  c("Maximal Margin Hyperplane", "Margin", "Support Vector", "Sub-optimal Maximal Margin Hyperplane", "Sub-optimal Margin", "Sub-optimal Support Vector"), 
  lty=c("solid", "dashed", NA, "solid", "dashed", NA),
  lwd=c(1, 1, NA, 1, 1, NA),
  pch=c(NA, NA, 1, NA, NA, 1),
  col=c("lightgrey", "lightgrey", "lightgrey", "black", "black", "black"),
  bty="n"
)

The above hyperplane is sub-optimal because (a) it relies on more support vectors (4 vs. 3 previously), and (b) it has a wider margin so that some observations fall within the margin.

The sub-optimal hyperplane has the equation:

\(-0.846\) \(-0.615X_1 + 1.077X_2 = 0\)

which, rearranging, gives us the equation in (\(X_1,X_2\)) space:

\(X_2 = 0.571X_1 + 0.786\)

Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

plot(X1, X2, col=Y, pch=19)
points(4, 2, pch=19, col="red")
points(4, 2, pch=1, cex=3, col="red")

Here we can see that the new observation (4,2) results in the inability to draw a hyperplane that separates the reds from the blues.

Question 6 - Hierarchical Clustering

Consider the USArrests data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

library(ISLR)
set.seed(1)
hc.complete = hclust(dist(USArrests), method="complete")
plot(hc.complete)

Cut the dendogram at a height that results in three distinct clusters. Which states belong to which clusters?

hc.cut = cutree(hc.complete, 3)

Cluster #1:

subset(names(hc.cut), hc.cut==1)
 [1] "Alabama"        "Alaska"         "Arizona"        "California"    
 [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
 [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
[13] "New Mexico"     "New York"       "North Carolina" "South Carolina"

Cluster #2:

subset(names(hc.cut), hc.cut==2)
 [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
 [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
 [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
[13] "Washington"    "Wyoming"

Cluster #3:

subset(names(hc.cut), hc.cut==3)
 [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
 [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
 [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
[13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
[17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

hc.complete.scaled = hclust(dist(scale(USArrests)), method="complete")
plot(hc.complete.scaled)

hc.cut.scaled = cutree(hc.complete.scaled, 3)

Cluster #1:

subset(names(hc.cut.scaled), hc.cut.scaled==1)
[1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
[5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"

Cluster #2:

subset(names(hc.cut.scaled), hc.cut.scaled==2)
 [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
 [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
[11] "Texas"

Cluster #3:

subset(names(hc.cut.scaled), hc.cut.scaled==3)
 [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
 [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
 [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
[13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
[17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
[21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
[25] "Utah"          "Vermont"       "Virginia"      "Washington"   
[29] "West Virginia" "Wisconsin"     "Wyoming"

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

We can see the effect that scaling has on the clustering by looking at the Confusion Matrix (scaled vs. non-scaled):

table(hc.cut.scaled, hc.cut)
             hc.cut
hc.cut.scaled  1  2  3
            1  6  2  0
            2  9  2  0
            3  1 10 20

Scaling the data before clustering has resulted in the reclassification of some of the US States into a different cluster than that which was obtained using the raw, unscaled data. For example, California was previously in Cluster #1 but is now in Cluster #2 and Georgia was previously in Cluster #2 but is now in Cluster #1.

Scaling data removes the effect that differences in measurement units can have on the dissimilarity measure used. By transforming each variable from its raw value, to the number of standard deviations that value is from the mean of the values for that variable, the scale() function levels each variable to the same scale, so that the distance measure (eg. Euclidean distance) is not unduly influenced by variables that have a differenct scale. In other words, under scaling, each variable will be given equal importance in the clustering algorithm. For example, if the distance on one dimension between two data points is measured in metres and that distance is 100m, while all of the other variables are measure in kilometres and their distances range between 0.1km and 1.0km, then we can see that the variable measured in metres will have undue prominence in the Euclidean distance calculation and, as a result, the clustering algorithm will tend towards emphasising clusters based on their ‘metre’ predictor. This may, or may not, be desireable.

The question of whether or not scaling is desireable, therefore, depends on the underlying data and the nature of the question being asked of that data. For example, in the case of retail shopping data where the data set consists of a large number of low-value items and small number of high-value items, without scaling, the high-volume/low-value transactions may take greater importance in clustering shoppers. This unscaled data would be aptly suited to clustering individuals into High, Medium and Low item shoppers, however, it would be less than suitable if the desire was to cluster individuals by the value of purchases made. In the latter case, scaling the data would remove the high-volume/low-value bias in the raw data.

In the case of the USArrests data set, the variables are measured in different units: Murder, Rape and Assault are measured as the number of occurences per 100,000 people, while UrbanPop is measured as the percentage of a State’s population that lives in Urban (as opposed to Rural) areas. The individual variances of these raw variables are:

apply(USArrests, 2, var)
    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916

As Assault has a very large variance compare to the others, our clustering algorithm (which uses Euclidean distance and Complete linkage) will give preference to this variable and our clusters will be biased towards grouping States with High, Mediam and Low Assault crime rates. If we scale, the variables, however:

apply(scale(USArrests), 2, var)
  Murder  Assault UrbanPop     Rape 
       1        1        1        1

then each variable has equal importance. Thus, if we are interested in separating States into clusters of High, Mediam and Low overall crime rates, then scaling the data is desirable.

Question 7 - PCA and K-Means Clustering

In this problem, you will generate simulated data and then perform PCA and K-Means Clustering on the data.

Generate a simulated data set with 20 observations in each of these classes (ie. 60 observations total) and 50 variables.

Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

set.seed(1)
x1 = matrix(rnorm(20 * 50, mean=0), ncol=50)
set.seed(2)
x2 = matrix(rnorm(20 * 50, mean=1), ncol=50)
set.seed(3)
x3 = matrix(rnorm(20 * 50, mean=2), ncol=50)
x = rbind(x1, x2, x3)
sample.id = c(rep(1, 20), rep(2, 20), rep(3, 20))
plot(x, col=sample.id, pch=20)

The above plot of the first two (out of 50) variables clearly shows each sample has a good co-mingling with the other samples such that colour-coding is needed to clearly identify the three samples, even though each sample is centered around a different mean.

Perform PCA on the 60 obserations and plot the first two principal components’ eigenvector. Use a different colour to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component eigenvectors.

plot(prcomp(x)$x[,1:2], pch=20, col=sample.id)

Using the first two principal components (as opposed ot the first two variables - ie. (a) above) provides a clearer separation of the three samples.

Perform K-Means Clustering of the observations with K = 3. How well do the clusters that you obtained in K-Means Clustering compare to the true class labels?

Hint:You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-Means Clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

cluster = kmeans(x, 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  0  0 20
      2  0 20  0
      3 20  0  0

The above table shows a perfect clustering of all three samples. Each sample falls into its own, separate cluster.

Perform K-Means Clustering with K = 2. Describe your results.

cluster = kmeans(x, 2, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1 20  0  0
      2  0 20 20

With K=2, both Sample IDs 2 and 3 are clustered together.

Now perform K-Means Clustering with K = 4 and describe your results.

cluster = kmeans(x, 4, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  9  0  0
      2  0 20  0
      3 11  0  0
      4  0  0 20

With K=4, Sample ID 1 has been split into two separate clusters.

Now perform K-Means Clustering with K = 3 on the first two principal components rather than on the raw data. That is, perform K-Means Clustering on the 60 x 2 matrix of which the first column is the the first principal component’s corresponding eigenvector and the second column is the second principal component’s corresponding eigenvector. Comment on the results.

cluster = kmeans(prcomp(x)$x[,1:2], 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1 20  0  0
      2  0 20  0
      3  0  0 20

Using the Principal Components, we get the same results in (c) above, ie. perfect clustering.

Using the scale() function perform K-Means Clustering with K = 3 on the data after scaling each variable to have standard deviation of one. How do these results compare to those obtained in (b)? Explain.

cluster = kmeans(scale(x), 3, nstart=20)$cluster
table(cluster, sample.id)
       sample.id
cluster  1  2  3
      1  0 20  0
      2  0  0 20
      3 20  0  0

Scaling has had no effect on the K=3 case. All samples are clustered correctly. This is the expected result. The scale() function subtracts the mean of each column from each of the values in that column in order to ‘re-centre’ the data. It then divides each resulting value in that column by the standard deviation of that column so that the final data in each column has a standard deviation of 1. The way we constructed our data set in (a) was such that, in each column, we have 3 subsets of data - all with the same standard deviation but with different means. Subtracting the mean of the whole column from these ‘subset’ means does not change the differences between the subset means. Consequently, there is still sufficient differences between the subsets such that the K-Means algorithm will still cluster the same data into the same clusters irrespective of whether or not the data has been scaled.

Answer 22

For the radial SVM, the lowest cross-validation error comes at a cost of 100 with a gamma of 0.01. Compared to the linear SVM in (b), the cross-validation error of this ‘best’ radial SVM is approx. 1x that of the linear SVM (1.532% vs. 1.276%).

Answer 23

The above comparison become somewhat easier if we plot the results:

Answer 24

Here we can see that the Linear classifier gives the best cross-validation performance: it offers the lowest error rate with the least cost. We can also see that we get similar performance (to the Linear case) from low-gamma Radial classifiers, if we are willing to accept greater cost. As the value of cost increases, these low-gamma classifiers outperform the Linear classifier. Although it is difficult to see from the above chart, the gamma=0.01 Radial classifier starts to outperform the Linear classifier when cost is above 30 (approx.) and the gamma=0.1 Radial classifier does better than the Linear classifier when cost is above 40 (approx.). Finally, we can see that the Polynomial classifiers are not very useful as they require very high values of cost to get their performance anywhere near the Linear classifier.

Answer 25

The below charts illustrate the ‘best performance’ case for each of the Linear, Radial and Polynomial SVMs as determined in (b) and (c) above. As there are more than two predictor variables (horsepower, acceleration, etc), visualising each SVM in a single chart is not possible. Instead, a series of 2-dimensional charts have been plotted.

Answer 26

The background of each chart has different shading to highlight the position of the decision boundary. As can be seen, the decision boundary is clearly shown in all of the Linear charts, whereas the Polynomial charts fail to illustrate this demarcation confirming what was stated above that the Polynomial SVM is not a good model for the Auto data. Some of the Radial charts illustrate the decision boundary and we can see it is non-linear although the shape of the decision boundary in the charts for displacement and weight, suggests the non-linearity is approaching linearity. The Radial decision boundary is not visible in the chart for horsepower. This combination of charts also supports the conclusion made in (c) above, that the Radial model is better than the Polynomial, but still inferior to the Linear.

Answer 27

Linear Charts

Answer 28

Polynomial Charts

Answer 29

Radial Charts

Answer 30

The hyperplane has the equation:

\(0.5 - X_1 + X_2 = 0\)

which, rearranging, gives us the equation in (\(X_1,X_2\)) space:

\(X_2 = X_1 -0.5\)

Answer 31

Classify as Red if \(0.5 - X_1 + X_2 > 0\) and Blue otherwise.

Answer 32

The support vectors are points 2, 3, 6:

Answer 33

The 7th observation (4,1) is not a support vector and is far away from both the hyperplane and its margin. Consequently, any slight movement in its position will not affect the hyperplance (nor its margin).

Answer 34

The above hyperplane is sub-optimal because (a) it relies on more support vectors (4 vs. 3 previously), and (b) it has a wider margin so that some observations fall within the margin.

The sub-optimal hyperplane has the equation:

\(-0.846\) \(-0.615X_1 + 1.077X_2 = 0\)

which, rearranging, gives us the equation in (\(X_1,X_2\)) space:

\(X_2 = 0.571X_1 + 0.786\)

Answer 35

Here we can see that the new observation (4,2) results in the inability to draw a hyperplane that separates the reds from the blues.

Answer 36

We can see the effect that scaling has on the clustering by looking at the Confusion Matrix (scaled vs. non-scaled):

Answer 37

Scaling the data before clustering has resulted in the reclassification of some of the US States into a different cluster than that which was obtained using the raw, unscaled data. For example, California was previously in Cluster #1 but is now in Cluster #2 and Georgia was previously in Cluster #2 but is now in Cluster #1.

Answer 38

Scaling data removes the effect that differences in measurement units can have on the dissimilarity measure used. By transforming each variable from its raw value, to the number of standard deviations that value is from the mean of the values for that variable, the scale() function levels each variable to the same scale, so that the distance measure (eg. Euclidean distance) is not unduly influenced by variables that have a differenct scale. In other words, under scaling, each variable will be given equal importance in the clustering algorithm. For example, if the distance on one dimension between two data points is measured in metres and that distance is 100m, while all of the other variables are measure in kilometres and their distances range between 0.1km and 1.0km, then we can see that the variable measured in metres will have undue prominence in the Euclidean distance calculation and, as a result, the clustering algorithm will tend towards emphasising clusters based on their ‘metre’ predictor. This may, or may not, be desireable.

Answer 39

The question of whether or not scaling is desireable, therefore, depends on the underlying data and the nature of the question being asked of that data. For example, in the case of retail shopping data where the data set consists of a large number of low-value items and small number of high-value items, without scaling, the high-volume/low-value transactions may take greater importance in clustering shoppers. This unscaled data would be aptly suited to clustering individuals into High, Medium and Low item shoppers, however, it would be less than suitable if the desire was to cluster individuals by the value of purchases made. In the latter case, scaling the data would remove the high-volume/low-value bias in the raw data.

Answer 40

In the case of the USArrests data set, the variables are measured in different units: Murder, Rape and Assault are measured as the number of occurences per 100,000 people, while UrbanPop is measured as the percentage of a State’s population that lives in Urban (as opposed to Rural) areas. The individual variances of these raw variables are:

Answer 41

As Assault has a very large variance compare to the others, our clustering algorithm (which uses Euclidean distance and Complete linkage) will give preference to this variable and our clusters will be biased towards grouping States with High, Mediam and Low Assault crime rates. If we scale, the variables, however:

Answer 42

then each variable has equal importance. Thus, if we are interested in separating States into clusters of High, Mediam and Low overall crime rates, then scaling the data is desirable.

Answer 43

The above plot of the first two (out of 50) variables clearly shows each sample has a good co-mingling with the other samples such that colour-coding is needed to clearly identify the three samples, even though each sample is centered around a different mean.

Answer 44

Using the first two principal components (as opposed ot the first two variables - ie. (a) above) provides a clearer separation of the three samples.

Answer 45

The above table shows a perfect clustering of all three samples. Each sample falls into its own, separate cluster.

Answer 46

With K=2, both Sample IDs 2 and 3 are clustered together.

Answer 47

With K=4, Sample ID 1 has been split into two separate clusters.

Answer 48

Using the Principal Components, we get the same results in (c) above, ie. perfect clustering.

Answer 49

Scaling has had no effect on the K=3 case. All samples are clustered correctly. This is the expected result. The scale() function subtracts the mean of each column from each of the values in that column in order to ‘re-centre’ the data. It then divides each resulting value in that column by the standard deviation of that column so that the final data in each column has a standard deviation of 1. The way we constructed our data set in (a) was such that, in each column, we have 3 subsets of data - all with the same standard deviation but with different means. Subtracting the mean of the whole column from these ‘subset’ means does not change the differences between the subset means. Consequently, there is still sufficient differences between the subsets such that the K-Means algorithm will still cluster the same data into the same clusters irrespective of whether or not the data has been scaled.

Big Data Analytics - Coursework 2

Student name:      Hammond Mason
Student number:
Programme:          M.Sc.(Data Science)

Question 1 - Decision Trees

Question 2 - Regression Trees

Question 3 - Classification Trees

Question 4 - SVM

Question 5 - SVM

Question 6 - Hierarchical Clustering

Question 7 - PCA and K-Means Clustering

	predicted	Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US
1	9.116667	9.50	138	73	11	276	120	Bad	42	17	Yes	Yes
2	12.082000	11.22	111	48	16	260	83	Good	65	10	Yes	Yes
3	7.512222	10.06	113	35	10	269	80	Medium	59	12	Yes	Yes
5	4.627200	4.15	141	64	3	340	128	Bad	38	13	Yes	No
7	7.130769	6.63	115	105	0	45	108	Medium	71	15	Yes	No
8	11.014000	11.85	136	81	15	425	120	Good	67	10	Yes	Yes

Big Data Analytics - Coursework 2

Student name: Hammond Mason Student number: Programme: M.Sc.(Data Science)

Question 1 - Decision Trees

Question 2 - Regression Trees

Question 3 - Classification Trees

Question 4 - SVM

Question 5 - SVM

Question 6 - Hierarchical Clustering

Question 7 - PCA and K-Means Clustering

Student name: Hammond Mason
Student number:
Programme: M.Sc.(Data Science)