# Dimensionality reduction

## PCA and SVD

live together in peace and harmony :) scnr

• False: Eigenvalue decomposition is performed on the centered variance-covariance matrix.
• True
• True: Rotating only changes the eigenvectors. The error is only influenced by the eigenvalues.
• False: The rating does depend on the initialization of the unknown values, therefore the performance is also affected.
• False: Depends on whether the unknown entries are also filled the same way. If they are the same, then it's True.
• False: No assumptions about the meaning of the feature vectors can be made.

## PCA

### Part 1

#### a)

Let $A=USV^{t}$ be the SVD of $A$. We then have to diagonalise $X$:

$X=A^{T}A=(USV^{T})^{T}(USV^{T})=VSU^{T}USV^{T}=VSISV^{T}=VS^{2}V^{T}$

#### b)

PCA has to find the Eigenvalue decomposition of ${\frac {1}{N}}AA^{T}\in \mathbb {R} ^{d\times d}$. This is usually done via power method. Since $d$ is assumed to be large we perform many matrix multiplications with a huge matrix which is slow. In SVD we can first calculate $A^{T}A\in \mathbb {R} ^{n\times n}$ and then do the eigenvalue decomposition on this way smaller matrix.

The solution above is incorrect, since the problems explicitly states that $A^{T}A\in \mathbb {R} ^{d\times d}$ .

### Part 2

#### a)

${\bar {X}}={\begin{pmatrix}-1&0&1\\1&0&-1\end{pmatrix}}\Sigma ={\frac {1}{3}}{\begin{pmatrix}2&-2\\-2&2\end{pmatrix}}u_{1}={\frac {1}{\sqrt {2}}}{\begin{pmatrix}1\\-1\end{pmatrix}}$

#### b)

We project the points onto the principle direction

$u_{1}={\frac {1}{\sqrt {2}}}{\begin{pmatrix}1\\-1\end{pmatrix}}$

And get the new coordinates

$(-{\sqrt {2}}),(0),{\text{and}}({\sqrt {2}})$

#### c)

$var(X)=E((X-EX)^{2})=EX^{2}-(EX)^{2}=(2+0+2)/3=4/3$.

The above mentioned solution is wrong. The task is to calculate the variance of the projected data. Therefore, $z^{T}z/n=1/4$.

#### d)

0, there is no error, as the first principal component explains all the data.

## SVD

### a)

$Y=X+uu^{T}=X^{T}+(u^{T})^{T}u^{T}=X^{T}+(uu^{T})^{T}=(X+uu^{T})^{T}=Y^{T}$

### b)

$Yu=(X+uu^{T})u=Xu+uu^{T}u=\lambda _{1}u+u\cdot 1=(\lambda _{1}+1)u$

### c)

$Yv=(X+uu^{T})v=Xv+uu^{T}v=Xv+u\cdot 0\$(symmetric matrices have orthogonal EVs)$=\lambda _{2}v$

# Clustering Mixture Models, NMF

## Mixture Model and Expectation Maximization

### a)

K-Means assigns each data point to exactly one centroid.

A GMM softly assignes a point to several clusters with the $\pi _{k}$ weights.

K-Means assumes sphere-shaped clusters of same size, GMM clusters can fit any ellipsoid shape (by fitting the covariance matrix).

### b)

$\langle x\rangle =E(p_{\theta }(x))=E(\sum \limits _{k=1}^{K}\pi _{k}{\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum \limits _{k=1}^{K}\pi _{k}E({\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum \limits _{k=1}^{K}\pi _{k}\mu _{k}$

### c)

${\mathcal {L}}(X,\theta )=\sum \limits _{n=1}^{N}\log \sum \limits _{k=1}^{K}\pi _{k}{\mathcal {N}}(x_{n}|\mu _{k},\Sigma _{k})$

### d)

${\mathcal {L}}(X,\theta )=\sum \limits _{n=1}^{N}\log \sum \limits _{k=1}^{K}q_{kn}\pi _{k}{\frac {{\mathcal {N}}(x|\mu _{k},\Sigma _{k})}{q_{kn}}}\geq \sum \limits _{n=1}^{N}\sum \limits _{k=1}^{K}q_{kn}\log \left(\pi _{k}{\frac {{\mathcal {N}}(x|\mu _{k},\Sigma _{k})}{q_{kn}}}\right)$

### e)

Well just take the chain rule and derive the lower bound. I have no clue about the derivative of a determinante nor of an inverse and it's only 3 points :P

See Problem 1 question 7 (Exercise 9 2019) http://www.da.inf.ethz.ch/teaching/2019/CIL/exercises/solution09.pdf

## Word embeddings, Block Chain and Neural Networks

### a)

$x_{i}^{(k+1)}=x_{i}^{k}-\eta \nabla _{x_{i}}{\mathcal {H}}(\theta ,N)=x_{i}^{k}+2\eta f(n_{ij})(\log n_{ij}-\langle x_{i},y_{j}\rangle )y_{j}$

$y_{j}^{(k+1)}=y_{j}^{k}-\eta \nabla _{y_{j}}{\mathcal {H}}(\theta ,N)=y_{j}^{k}+2\eta f(n_{ij})(\log n_{ij}-\langle x_{i},y_{j}\rangle )x_{i}$

### b)

Rectified Linear Unit: $Relu(x)=max(x,0)$

Sigmoid: $s(x)={\frac {1}{1+e^{-x}}}$

### c)

• localized features
• translation invariance
• weight sharing -> less parameters

# Sparse Coding and Robust PCA

## Fourier and wavelet transform

### Part 1

#### a)

• Easily invertible by transposing
• Length preservation of signals

#### b)

• Take low frequencies
• Invert Fourier ??

#### c)

High frequency signals in the scarf and blanket patterns

#### d)

One can capture local extrema without causing a periodic repetition of it. So it's better for non-periodic signals

#### e)

• True
• False: if rank is full then determinante is either 1 or -1
• True: Length preservation
• False: Why shouldn't it?

### Part 2

#### a)

$m(B)=0,\ {\frac {1}{\sqrt {2}}}(1,1)^{T}$

#### b)

We also demand the solution to be sparse, i.e. contain as many 0 entries as possible

#### c)

• Overcomplete dictionaries
• Dictionary learning

Alternative solution:

• Matching persuit
• Convex relaxation (use 1-norm instead of 0-norm)

#### d)

• Signal compression
• Denoising of signals

## Robust PCA

Because he is not moving.

Note: not covered in 2018

# Optimisation

## Convex Optimisation

• True
• False: (1,0) and (0,1) are in set but (0.5, 0.5) is not
• False: ${\begin{pmatrix}1&0\\0&0\end{pmatrix}}$ and ${\begin{pmatrix}0&0\\0&1\end{pmatrix}}$
• False: (1,0) and (0,1)

## Lagrange Duality

Dual function not covered in 2018

## Optimisation methods

### Part 1

(0,0)

#### b)

$\nabla f(x,y)=(2x,2y)^{T}$

#### c)

$z^{(0)}=(2,2)^{T}z^{(1)}=z^{(0)}-\eta \nabla f(x,y)=(2,2)^{T}-{\frac {1}{2}}(4,4)^{T}=(0,0)^{T}$ 1 step is needed. 1 step is great

#### d)

$\eta =1$ oscillates between two points. $\eta ={\frac {1}{4}}$ takes longer until convergence (actually infinitely many steps)

#### e)

Not too large, not too small (very math-y I know)

(Maybe they meant the conditions $\sum _{i}\eta _{i}=\infty ,\sum _{i}\eta _{i}^{2}<\infty$? Wasn't covered this year though)

### Part 2

Coordinate Descent?

AFAIK not covered in 2018