High-Dimensional data and their projections

Class 4

February 12th, 2020

Data Science Tech. & App.

High-dimensional data

Data Science context

Dataset: n points in a d-dimensional space:
essentially, a n x d matrix of floats
For $d > 3$ and growing, several practical problems

1-hot encodings raise dimensionality

How to see dimensions

	X₁	X₂	…	X_d
x₁	x₁₁	x₁₂	…	x_1d
x₂	x₂₁	x₂₂	…	x_2d
…	…	…	…	…
x_n	x_n1	x_n2	…	x_nd

Issues

visualization is hard, we need projection. Which?
decision-making is impaired by the need of chosing which dimensions to operate on
sensitivity analyis or causal analysis: which dimension affects others?

Structure of the Class

Hypergeometry

At higher dims., Geometry is not what we experienced in d=3.

Higly counter-intuitive properties

Adding dimensions migh seem to increase sparsity.

this might sound good for a clean-cut segmentation of the data

In high dimension all points tend to be at the same distance from each other

Exp: generate a set of random points in $D^n$ , compute Frobenius norms: very little variance.

bye bye, k-NN

In high dimension, all diagonals strangely become orthogonal to the axes

a set of points nicely distributed along a diagonal gets compressed around the origin of axes.

bye bye, Cosine Similarity

Dim. reduction A: PCA

A technique from Geometry

it seeks a lower-dimensional representation $(r<< d)$ that still captures the variance in data.

Idea: rebase the data space to minimize co-variances

Dim. reduction B: Singular-value decomp.

A technique from Spectral analysis

Last week we have seen the bases: extend eigen-decomposition to arbitrary matrices.

Singular values uncover categories and their strenghts.

Dim. reduction C: Linear Discriminant Analysis

Trained classification, like Decision trees.

When data is 2-dimensional, e.g., Iris, LDA seeks

a vector $\mathbf{w}$ that maximises the separation between the classes after datapoints are projected onto $\mathbf{w}$ .

Strategy: find the direction over which the means of each class are projected as far from each other as possible.

High-Dimensional data

I: a false sense of emptyness…

adding dimensions makes points seems further apart:

Name	Type	Degrees
Chianti	Red	12.5
Grenache	Rose	12
Bordeaux	Red	12.5
Cannonau	Red	13.5

d(Chianti, Bordeaux) = 0

d(Chianti, Grenache) = $\sqrt{ 1^2 + 5^2} =5.1$

not close anymore?

-adding dimensions make points seem further from each other

Name	Type	Degrees	Grape	Year
Chianti	Red	12.5	Sangiovese	2016
Grenache	Rose	12	Grenache	2011
Bordeaux	Red	12.5		2009
Cannonau	Red	13.5	Grenache	2015

d(Chianti, Bordeaux) >7

d(Chianti, Grenache) > $\sqrt{5^2 + 1^2 + 5^2} =7.14$

II: the collapsing on the surface

bodies have most of their mass distributed close to the surface (even under uniform density)

for d=3, $vol= \frac{4}{3}\pi r^3$ .
With 50% radius, vol. is only $\frac{1}{8}=12.5\%$

A possibly misguiding picture

an important, counter-intuitive fact:

For a given radius, raising dimensionality above 5 in fact decreases the volume

(from Wikipedia)

Volume will concentrate near the surface: most points will look to be at uniform distance

distance-based similarity fails

The curse of dimensionality

PCA

Intuition: a better visualization

Sacrifice some details and focus on the key dimensions of the dataset
i.e., project data points down to 1, 2 or 3 dimensions

-What dimension we should visualize?

the dimension which is most spread, so to keep points as separate as possible

try visualizing the wines by alcohol strenght and by year…

might require normalization of the dimension over [0, 1]

a new dimension, combination of the previous ones, where the error introduced by projection is minimized.

requires some geometry manipulation

$\mathbf{x_i} = [x_1, x_2, \dots]^T$

can be expressed as a linear combination of orthonormal vectors

$\mathbf{x_i} = a_1\mathbf{u_1} + a_2\mathbf{u_2} +\dots + a_n\mathbf{u_n}$

where, e.g., $\mathbf{u_j}^T= [0, 0, \dots ,0, 1, 0,\dots 0]$

DR is seen as a base change:

x_{i} = U a \Rightarrow a = U^{T} x_{i}

$\mathbf{x_i} = \mathbf{U}\mathbf{a} \Rightarrow \mathbf{a}=\mathbf{U}^T \mathbf{x_i}$

Where needs not be the traditional X/Y/Z/… axis.

It suffices that columns of U are orthonormal

A visualization of base change

The Geometry of PCA

PCA seeks a lower-dimensional representation $(r< d)$ that still captures the variance in data.

The dimension where the largest projected variance occurs is the Principal component.
The dim (orthogonal to the first) that captures the 2nd-largest projected variance is the Second PC.

Maximization of variance turns out equiv. to min. of squared mean error (SME)

Need more Stats?

The covariance matrix captures the covariance for each pair of attributes:

Σ = E [(X - μ) (X - μ)^{T}] = (\begin{matrix} σ_{1}^{2} & σ_{12} & \dots & σ_{1 d} \\ σ_{21} & σ_{2}^{2} & \dots & σ_{2 d} \\ \dots & \dots & \dots & \dots \\ σ_{d 1} & σ_{d 2} & \dots & σ_{d}^{2} \end{matrix})

$\Sigma = E[(X-\mu)(X-\mu)^T]= \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \dots & \sigma_{1d} \\ \sigma_{21} & \sigma_2^2 & \dots & \sigma_{2d} \\ \dots & \dots & \dots & \dots \\ \sigma_{d1} & \sigma_{d2} & \dots & \sigma_{d}^2 \\ \end{pmatrix}$

Positive, semidefinite & simmetric: $\lambda$ ’s are real, $\ge 0$ .

Mapping $n\rightarrow r$

x_{i} = a_{1} u_{1} + a_{2} u_{2} + \dots + a_{n} u_{n}

$\mathbf{x_i} = a_1\mathbf{u_1} + a_2\mathbf{u_2} +\dots + a_n\mathbf{u_n}$

order is irrelevant (b/c of orthonormality)

truncate the above representation to just the most important $r$ dimensions

{x_{i}}^{'} = a_{(1)} u^{(1)} + a_{(2)} u^{(2)} + \dots a_{(r)} u^{(r)}

$\mathbf{x_i}^\prime = a_{(1)}\mathbf{u}^{(1)} + a_{(2)}\mathbf{u}^{(2)} +\dots a_{(r)}\mathbf{u}^{(r)}$

We understand the error introduced by the $U_r$ mappingwhen we “map back” ( $U_r U_r^T$ )to the original space:

{\hat{x}}_{i} = U_{r} U_{r}^{T} x

$\hat{\mathbf{x}}_i = U_r U_r^T\mathbf{x}$

The error vector $\mathbf{\epsilon}_i = \mathbf{x}_i - \hat{\mathbf{x}}_i$ remains orthogonal to the new point.

Error minimization

M S E (u) = \frac{1}{n} \sum | | ϵ_{i} | |^{2}

$MSE(\mathbf{u}) = \frac{1}{n} \sum ||\mathbf{\epsilon}_i||^2$

by complex manipulations (see, e.g., Zaki-Meira, p. 190)

M S E (u) = V a r (D) - u^{T} Σ u

$MSE(\mathbf{u}) = Var(\mathbf{D}) - \mathbf{u^T \Sigma u}$

$\mathbf{\Sigma}$ is the covariance matrix for the centered data $\mathbf{D}$ :

Σ = \frac{1}{n} \sum x_{i} x_{j}^{T}

$\mathbf{\Sigma} = \frac{1}{n}\sum x_i x_j^T$

Finally…

To minimize error, try to maximize the second term.

Fact: $\mathbf{u_1},$ the dominant eigenvector maximizes the projected variance

M S E (u_{1}) = V a r (D) - λ_{1}

$MSE(\mathbf{u_1}) = Var(\mathbf{D}) - \lambda_1$

Now the principal comp $\mathbf{u_1}$ both mazimises variance and minimizes mean squared error. The process ends.

PCA in Python

from matplotlib.mlab import PCA

results = PCA(D)

Further readings: PCA with

Scikit-learn
Plotly

3.the SherlockML platform

Singular-Value Decomposition

Practical SVD

please see the presentation of Ch. 11 of the MMDS textbook.

LDA (Intution)

Projection onto a vector

X: Sepal Lenght, Y: Sepal Width

I. Setosa : $\bigcirc$ ; the rest: $\triangle$

Solution

$\mathbf{w}^T=(0.551, -0.834)$

High-Dimensional data and their projections Class 4 February 12th, 2020