Class 4
February 12th, 2020
Dataset: n points in a d-dimensional space:
essentially, a n x d matrix of floats
For
X1 | X2 | … | Xd | |
---|---|---|---|---|
x1 | x11 | x12 | … | x1d |
x2 | x21 | x22 | … | x2d |
… | … | … | … | … |
xn | xn1 | xn2 | … | xnd |
visualization is hard, we need projection. Which?
decision-making is impaired by the need of chosing which dimensions to operate on
sensitivity analyis or causal analysis: which dimension affects others?
At higher dims., Geometry is not what we experienced in d=3.
Higly counter-intuitive properties
Adding dimensions migh seem to increase sparsity.
this might sound good for a clean-cut segmentation of the data
In high dimension all points tend to be at the same distance from each other
Exp: generate a set of random points in
bye bye, k-NN
In high dimension, all diagonals strangely become orthogonal to the axes
a set of points nicely distributed along a diagonal gets compressed around the origin of axes.
bye bye, Cosine Similarity
A technique from Geometry
it seeks a lower-dimensional representation
Idea: rebase the data space to minimize co-variances
A technique from Spectral analysis
Last week we have seen the bases: extend eigen-decomposition to arbitrary matrices.
Singular values uncover categories and their strenghts.
Trained classification, like Decision trees.
When data is 2-dimensional, e.g., Iris, LDA seeks
a vector
Strategy: find the direction over which the means of each class are projected as far from each other as possible.
adding dimensions makes points seems further apart:
Name | Type | Degrees |
---|---|---|
Chianti | Red | 12.5 |
Grenache | Rose | 12 |
Bordeaux | Red | 12.5 |
Cannonau | Red | 13.5 |
d(Chianti, Bordeaux) = 0
d(Chianti, Grenache) =
-adding dimensions make points seem further from each other
Name | Type | Degrees | Grape | Year |
---|---|---|---|---|
Chianti | Red | 12.5 | Sangiovese | 2016 |
Grenache | Rose | 12 | Grenache | 2011 |
Bordeaux | Red | 12.5 | 2009 | |
Cannonau | Red | 13.5 | Grenache | 2015 |
d(Chianti, Bordeaux) >7
d(Chianti, Grenache) >
for d=3,
With 50% radius, vol. is only
For a given radius, raising dimensionality above 5 in fact decreases the volume
(from Wikipedia)
Volume will concentrate near the surface: most points will look to be at uniform distance
Sacrifice some details and focus on the key dimensions of the dataset
i.e., project data points down to 1, 2 or 3 dimensions
-What dimension we should visualize?
try visualizing the wines by alcohol strenght and by year…
can be expressed as a linear combination of orthonormal vectors
where, e.g.,
DR is seen as a base change:
Where needs not be the traditional X/Y/Z/… axis.
It suffices that columns of U are orthonormal
PCA seeks a lower-dimensional representation
The dimension where the largest projected variance occurs is the Principal component.
The dim (orthogonal to the first) that captures the 2nd-largest projected variance is the Second PC.
The covariance matrix captures the covariance for each pair of attributes:
Positive, semidefinite & simmetric:
order is irrelevant (b/c of orthonormality)
truncate the above representation to just the most important
We understand the error introduced by the
The error vector
by complex manipulations (see, e.g., Zaki-Meira, p. 190)
To minimize error, try to maximize the second term.
Fact:
Now the principal comp
Further readings: PCA with
3.the SherlockML platform
please see the presentation of Ch. 11 of the MMDS textbook.
X: Sepal Lenght, Y: Sepal Width
I. Setosa :