# Lösungsvorschlag Computational Intelligence Lab FS15: Unterschied zwischen den Versionen

## 1 Dimensionality reduction

### 1.1 SVD and PCA

a) False, tries to maximize the variance
b) False, because the principal component is the eigenvector of the covariance matrix, not of the data matrix. Also, eigenvalues are not defined for rectangular matrices.
c) True
d) False, vectors are rotated as well
e) True

### 1.2 SVD

a) U and V are orthogonal, so for any x: ${\displaystyle \|Ux\|_{2}^{2}=\|V^{T}x\|_{2}^{2}=\|x\|_{2}^{2}}$

Therefore

${\displaystyle \max _{\left\|x\right\|=1}\left\|Ax\right\|=\max _{\left\|x\right\|=1}\left\|USV^{\top }x\right\|=\max _{\left\|y\right\|=1}\left\|USy\right\|=\max _{\left\|x\right\|=1}\left\|USx\right\|=\max _{\left\|x\right\|=1}\left\|Sx\right\|}$

The last step is again due to the orthogonality of U, i.e. it does not change the norm of the vector ${\displaystyle (Sx)}$.

b) For diagonal matrix S it holds: ${\displaystyle rank(S)=rank(S^{\top }S)=rank(S^{\top }S)}$.

${\displaystyle rank(A)=rank(USV^{\top })=rank(S)}$

${\displaystyle rank(AA^{\top })=rank(USV^{\top }VS^{\top }U^{\top })=rank(UDU^{\top })=rank(D)=rank(SS^{\top })=rank(S)}$, where ${\displaystyle D=SS^{\top }}$

${\displaystyle rank(A^{\top }A)=rank(VS^{\top }U^{\top }USV^{\top })=rank(VDV^{\top })=rank(D)=rank(S^{\top }S)=rank(S)}$, where ${\displaystyle D=S^{\top }S}$

Therefore

${\displaystyle rank(A)=rank(A^{\top }A)=rank(AA^{\top })}$

### 1.3 PCA

#### Part 1

${\displaystyle XX^{T}={\begin{pmatrix}6&0\\0&2\\\end{pmatrix}}}$

with principal axes

${\displaystyle {\begin{pmatrix}1\\0\\\end{pmatrix}}}$ and ${\displaystyle {\begin{pmatrix}0\\1\\\end{pmatrix}}}$

#### Part 2

to project, draw a line orthogonal to the axis through the point. The projection is where it crosses the axis.

The axis from top to bottom is better (i.e. / is better than \), as the variance is higher amongst the projected points

${\displaystyle Y={\begin{pmatrix}1&-3&-1&3&2&-6&-2&6\\1&-5&4&-2&2&-10&8&-4\end{pmatrix}}}$

The mean of Y is [0, -0.75] which is different from the mean of X ([0, -0.5]). This is why the suggested solution above, while elegant, does not work.

${\displaystyle {\bar {Y}}={\begin{pmatrix}1&-3&-1&3&2&-6&-2&6\\1.75&-4.25&4.75&-1.25&2.75&-9.25&8.75&-3.25\end{pmatrix}}}$

${\displaystyle {\bar {Y}}{\bar {Y}}^{T}={\begin{pmatrix}100&30\\30&225.5\end{pmatrix}}}$

The principal axes are shifted very slightly: new [-0.9752, 0.2211] vs old [-0.9767, 0.2145] and new [-0.2211, -0.9752] vs old [-0.2145, -0.9767].

I think the main point they want to make here is that by adding these points the direction where the data has its main variance does not really change. That is, the principal axis stay the same, as already pointed out in the provided derivation. In my opinion, calculating the covariance matrix for such a data matrix in a 2 hour exam is cumbersome - it should be enough to give a reasonable explanation why the principal components stay more or less the same.

## 2 Clustering, Mixture Models, NMF

### 2.1 K-means Clustering

Set ${\displaystyle \nabla _{u_{k}}J(U,Z)=0}$ solve for ${\displaystyle u_{k}}$

${\displaystyle u_{k}={\frac {\sum _{n}z_{kn}x_{n}}{1+\sum _{n}z_{kn}}}}$

### 2.2 Mixture Model

${\displaystyle {\mathcal {L}}=\sum _{n}log(\sum _{k}\pi _{k}g(x_{n};\lambda _{k}))}$

Naive update

${\displaystyle {\dfrac {\partial \log L}{\partial \lambda _{i}}}={\dfrac {\partial }{\partial \lambda _{i}}}\sum _{n}^{N}\log \left(\sum _{j}^{k}\pi _{j}{\dfrac {\lambda _{j}^{x_{n}}e^{-\lambda _{j}}}{x_{n}!}}\right)==\sum _{n}^{N}{\dfrac {1}{\sum _{j}^{k}\pi _{j}{\dfrac {\lambda _{j}^{x_{n}}e^{-\lambda _{j}}}{x_{n}!}}}}{\dfrac {\pi _{i}}{x_{n}!}}(x_{n}\lambda _{i}^{x_{n}-1}e^{-\lambda _{i}}-\lambda _{i}^{x_{n}}e^{-\lambda _{i}})=0}$

solve for ${\displaystyle \lambda _{i}}$...

The problem of a naive update equation is that ${\displaystyle \lambda _{i}}$ depends on all other ${\displaystyle \lambda }$s (Because they are all in the same log function, they are not separated when deriving by ${\displaystyle \lambda _{i}}$). This is why we switch to the EM algorithm.

The EM algorithm maximizes the lower bound given by the Jensen's inequality.

### 2.3 Nonnegative Matrix Factorizations

a) False, NMF provides soft clustering
b) False, rank(U) and rank(Z) are strictly smaller than rank(X). Therefore rank(UZ) is also smaller than rank(X).
c) False
d) True
e) True

## 3 Sparse Coding and Dictionary Learning

### 3.1 Sparse Coding

#### Part 1

a) False, not convex
b) False, change of norm is relaxation of the constraints and does not yield the same solution
c) False, it hopefully yields sparser solution, but not necessarily
d) False, for a orthonormal basis m(B)=0

#### Part 2

${\displaystyle (UU^{T})^{T}=U^{T}U=I^{T}=I}$

${\displaystyle U^{T}x=U^{T}Uz=Iz=z}$

Alternative proposal

${\displaystyle U^{-1}UU^{T}=U^{-1}I}$

${\displaystyle U^{T}=U^{-1}}$

${\displaystyle z=U^{-1}x}$

${\displaystyle z=U^{T}x}$

${\displaystyle ||Ux-Ux'||_{2}={\sqrt {(}}x^{T}U^{T}Ux-2x'^{T}U^{T}Ux+x'^{T}U^{T}Ux')={\sqrt {(}}x^{T}x-2x'x+x'^{T}x')=||x-x'||_{2}}$

Alternative proposal

${\displaystyle ||x-x'||_{2}=||Uz-Uz'||_{2}=||U(z-z')||_{2}=||z-z'||_{2}}$

#### Part 3

${\displaystyle z=U^{T}x}$

${\displaystyle {\hat {x}}=U{\hat {z}}}$

Take the parts with the two high peaks on the top signal and on bottom the part with the one high peak

Fourier is better for top signal

Wavelet better for bottom signal

The two peaks correspond to the two frequencies of the signal with the highest amplitudes

## 4 Optimization and Robust PCA

### 4.1 Lagrange Duality

a) ${\displaystyle L(x,\lambda ,\nu )=||x||^{2}+\lambda ^{T}(Ax-b)}$

b) ${\displaystyle d(\lambda ,\nu )=(inf_{x}||x||^{2}+\lambda ^{T}Ax)-\lambda ^{T}b}$

c) ${\displaystyle {\frac {\partial }{\partial x}}(||x||^{2}+\lambda ^{T}Ax)=2x+A^{T}\lambda }$

${\displaystyle {\tilde {x}}={\frac {-A^{T}\lambda }{2}}}$

${\displaystyle d(\lambda ,\nu )=L({\tilde {x}},\lambda ,\nu )=||-{\frac {A^{T}\lambda }{2}}||^{2}+\lambda ^{T}A({\frac {-A^{T}\lambda }{2}})-\lambda ^{T}b}$

${\displaystyle ={\frac {1}{4}}\lambda ^{T}AA^{T}\lambda -{\frac {1}{2}}\lambda ^{T}AA^{T}\lambda -\lambda ^{T}b=-{\frac {1}{4}}\lambda ^{T}AA^{T}\lambda -\lambda ^{T}b}$

### 4.2 Convex Optimization

a) False
b) False (it's in ${\displaystyle \mathbb {R} ^{D+1}}$: https://en.wikipedia.org/wiki/Epigraph_(mathematics) )
c) False, proof by contradiction ${\displaystyle f\left({\dfrac {1}{2}}{\begin{bmatrix}1\\-1\end{bmatrix}}+{\dfrac {1}{2}}{\begin{bmatrix}-1\\1\end{bmatrix}}\right)=f\left({\begin{bmatrix}0\\0\end{bmatrix}}\right)=0\geq {\dfrac {1}{2}}f\left({\begin{bmatrix}1\\-1\end{bmatrix}}\right)+{\dfrac {1}{2}}f\left({\begin{bmatrix}-1\\1\end{bmatrix}}\right)=-2}$
Contradicts the definition of convex functions.
Note that ${\displaystyle f\left({\begin{bmatrix}1\\-1\end{bmatrix}}\right)=f\left({\begin{bmatrix}-1\\1\end{bmatrix}}\right)=g\left({\begin{bmatrix}1&-1\\-1&1\end{bmatrix}}\right)=-1+(-1)=-2}$

### 4.3 Gradient Descent for Ridge Regression

a)

${\displaystyle x^{(k+1)}=x^{(k)}-\gamma \nabla f(x^{(k)})}$

${\displaystyle \nabla f(x)=A^{T}Ax-A^{T}b+\lambda x}$

b)

${\displaystyle x^{(k+1)}=x^{(k)}-\gamma (a_{i}^{T}a_{i}x_{i}-a_{i}^{T}b+\lambda _{i}x_{i})}$

Alternative proposal

The above proposal seems to subtract a scalar from a vector. I think the correct step is

${\displaystyle x^{(k+1)}=(1-\gamma \lambda )x^{(k)}-\gamma (a_{i}^{T}x_{i}-b_{i})a_{i}}$

(only consider one row of ${\displaystyle A}$ and the corresponding entry in ${\displaystyle b}$ at a time)

a)

${\displaystyle f(x)=f_{1}(x)+f_{2}(x)f_{1}(x_{1})={\frac {1}{2}}||Ax_{1}-b||^{2}f_{2}(x_{2})={\frac {\lambda }{2}}||x||^{2}s.t.x_{1}-x_{2}=0}$

b)

${\displaystyle L_{p}={\frac {1}{2}}||Ax_{1}-b||^{2}+{\frac {\lambda }{2}}||x_{2}||^{2}+v^{T}(x_{1}-x_{2})+{\frac {p}{2}}||x_{1}-x_{2}||^{2}}$

c)

${\displaystyle x_{1}=argmin_{x_{1}}L_{p}(x_{1},x_{2}^{(t)},v^{(t)})x_{2}=argmin_{x_{2}}L_{p}(x_{1}^{(t+1)},x_{2},v^{(t)})}$

### 4.5 RPCA for Collaborative Filtering

${\displaystyle min_{L,S}||L||_{*}+\mu ||S||_{1}s.t.L_{ij}+S_{ij}=X_{ij},\forall (i,j)\in \Omega _{obs}}$