Dimensionality reduction

PCA and SVD

live together in peace and harmony :) scnr

• False: Eigenvalue decomposition is performed on the centered variance-covariance matrix.
• True
• True: Rotating only changes the eigenvectors. The error is only influenced by the eigenvalues.
• False: The rating does depend on the initialization of the unknown values, therefore the performance is also affected.
• False: Depends on whether the unknown entries are also filled the same way. If they are the same, then it's True.
• False: No assumptions about the meaning of the feature vectors can be made.

PCA

Part 1

a)

Let ${\displaystyle A=USV^{t}}$ be the SVD of ${\displaystyle A}$. We then have to diagonalise ${\displaystyle X}$:

${\displaystyle X=A^{T}A=(USV^{T})^{T}(USV^{T})=VSU^{T}USV^{T}=VSISV^{T}=VS^{2}V^{T}}$

b)

PCA has to find the Eigenvalue decomposition of ${\displaystyle {\frac {1}{N}}AA^{T}\in \mathbb {R} ^{d\times d}}$. This is usually done via power method. Since ${\displaystyle d}$ is assumed to be large we perform many matrix multiplications with a huge matrix which is slow. In SVD we can first calculate ${\displaystyle A^{T}A\in \mathbb {R} ^{n\times n}}$ and then do the eigenvalue decomposition on this way smaller matrix.

The solution above is incorrect, since the problems explicitly states that ${\displaystyle A^{T}A\in \mathbb {R} ^{d\times d}}$ .

Part 2

a)

${\displaystyle {\bar {X}}={\begin{pmatrix}-1&0&1\\1&0&-1\end{pmatrix}}\Sigma ={\frac {1}{3}}{\begin{pmatrix}2&-2\\-2&2\end{pmatrix}}u_{1}={\frac {1}{\sqrt {2}}}{\begin{pmatrix}1\\-1\end{pmatrix}}}$

b)

We project the points onto the principle direction

${\displaystyle u_{1}={\frac {1}{\sqrt {2}}}{\begin{pmatrix}1\\-1\end{pmatrix}}}$

And get the new coordinates

${\displaystyle (-{\sqrt {2}}),(0),{\text{and}}({\sqrt {2}})}$

c)

${\displaystyle var(X)=E((X-EX)^{2})=EX^{2}-(EX)^{2}=(2+0+2)/3=4/3}$.

The above mentioned solution is wrong. The task is to calculate the variance of the projected data. Therefore, ${\displaystyle z^{T}z/n=1/4}$.

d)

0, there is no error, as the first principal component explains all the data.

SVD

a)

${\displaystyle Y=X+uu^{T}=X^{T}+(u^{T})^{T}u^{T}=X^{T}+(uu^{T})^{T}=(X+uu^{T})^{T}=Y^{T}}$

b)

${\displaystyle Yu=(X+uu^{T})u=Xu+uu^{T}u=\lambda _{1}u+u\cdot 1=(\lambda _{1}+1)u}$

c)

${\displaystyle Yv=(X+uu^{T})v=Xv+uu^{T}v=Xv+u\cdot 0\ }$(symmetric matrices have orthogonal EVs)${\displaystyle =\lambda _{2}v}$

Clustering Mixture Models, NMF

Mixture Model and Expectation Maximization

a)

K-Means assigns each data point to exactly one centroid.

A GMM softly assignes a point to several clusters with the ${\displaystyle \pi _{k}}$ weights.

K-Means assumes sphere-shaped clusters of same size, GMM clusters can fit any ellipsoid shape (by fitting the covariance matrix).

b)

${\displaystyle \langle x\rangle =E(p_{\theta }(x))=E(\sum \limits _{k=1}^{K}\pi _{k}{\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum \limits _{k=1}^{K}\pi _{k}E({\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum \limits _{k=1}^{K}\pi _{k}\mu _{k}}$

c)

${\displaystyle {\mathcal {L}}(X,\theta )=\sum \limits _{n=1}^{N}\log \sum \limits _{k=1}^{K}\pi _{k}{\mathcal {N}}(x_{n}|\mu _{k},\Sigma _{k})}$

d)

${\displaystyle {\mathcal {L}}(X,\theta )=\sum \limits _{n=1}^{N}\log \sum \limits _{k=1}^{K}q_{kn}\pi _{k}{\frac {{\mathcal {N}}(x|\mu _{k},\Sigma _{k})}{q_{kn}}}\geq \sum \limits _{n=1}^{N}\sum \limits _{k=1}^{K}q_{kn}\log \left(\pi _{k}{\frac {{\mathcal {N}}(x|\mu _{k},\Sigma _{k})}{q_{kn}}}\right)}$

e)

Well just take the chain rule and derive the lower bound. I have no clue about the derivative of a determinante nor of an inverse and it's only 3 points :P

See Problem 1 question 7 (Exercise 9 2019) http://www.da.inf.ethz.ch/teaching/2019/CIL/exercises/solution09.pdf

Topic Model and Nonnegative Matrix Factorisation

• True: Summary sheet formula matches
• True: Summary sheet formula matches
• Semi-what?
• False: See Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition. The introduction says the smallest ${\displaystyle K}$ is the non-negative rank of ${\displaystyle \mathbf {X} }$, which can be larger than the rank.
• False: The ${\displaystyle z_{kn}}$ on the right side looks strange

Word embeddings, Block Chain and Neural Networks

a)

${\displaystyle x_{i}^{(k+1)}=x_{i}^{k}-\eta \nabla _{x_{i}}{\mathcal {H}}(\theta ,N)=x_{i}^{k}+2\eta f(n_{ij})(\log n_{ij}-\langle x_{i},y_{j}\rangle )y_{j}}$

${\displaystyle y_{j}^{(k+1)}=y_{j}^{k}-\eta \nabla _{y_{j}}{\mathcal {H}}(\theta ,N)=y_{j}^{k}+2\eta f(n_{ij})(\log n_{ij}-\langle x_{i},y_{j}\rangle )x_{i}}$

b)

Rectified Linear Unit: ${\displaystyle Relu(x)=max(x,0)}$

Sigmoid: ${\displaystyle s(x)={\frac {1}{1+e^{-x}}}}$

c)

• localized features
• translation invariance
• weight sharing -> less parameters

Sparse Coding and Robust PCA

Fourier and wavelet transform

Part 1

a)

• Easily invertible by transposing
• Length preservation of signals

b)

• Take low frequencies
• Invert Fourier ??

c)

High frequency signals in the scarf and blanket patterns

d)

One can capture local extrema without causing a periodic repetition of it. So it's better for non-periodic signals

e)

• True
• False: if rank is full then determinante is either 1 or -1
• True: Length preservation
• False: Why shouldn't it?

Part 2

a)

${\displaystyle m(B)=0,\ {\frac {1}{\sqrt {2}}}(1,1)^{T}}$

b)

We also demand the solution to be sparse, i.e. contain as many 0 entries as possible

c)

• Overcomplete dictionaries
• Dictionary learning

Alternative solution:

• Matching persuit
• Convex relaxation (use 1-norm instead of 0-norm)

d)

• Signal compression
• Denoising of signals

Robust PCA

Because he is not moving.

Note: not covered in 2018

Optimisation

Convex Optimisation

• True
• False: (1,0) and (0,1) are in set but (0.5, 0.5) is not
• False: ${\displaystyle {\begin{pmatrix}1&0\\0&0\end{pmatrix}}}$ and ${\displaystyle {\begin{pmatrix}0&0\\0&1\end{pmatrix}}}$
• False: (1,0) and (0,1)

Lagrange Duality

Dual function not covered in 2018

Optimisation methods

Part 1

(0,0)

b)

${\displaystyle \nabla f(x,y)=(2x,2y)^{T}}$

c)

${\displaystyle z^{(0)}=(2,2)^{T}z^{(1)}=z^{(0)}-\eta \nabla f(x,y)=(2,2)^{T}-{\frac {1}{2}}(4,4)^{T}=(0,0)^{T}}$ 1 step is needed. 1 step is great

d)

${\displaystyle \eta =1}$ oscillates between two points. ${\displaystyle \eta ={\frac {1}{4}}}$ takes longer until convergence (actually infinitely many steps)

e)

Not too large, not too small (very math-y I know)

(Maybe they meant the conditions ${\displaystyle \sum _{i}\eta _{i}=\infty ,\sum _{i}\eta _{i}^{2}<\infty }$? Wasn't covered this year though)

Part 2

Coordinate Descent?

AFAIK not covered in 2018