# Lösungsvorschlag Computational Intelligence Lab FS17

Disclaimer: I am very unsure about those solutions, but still adding them so that something is there that can be corrected :D

# 1 Dimensionality reduction

## 1.1 PCA on an Ellipse

a)

${\displaystyle {\frac {(a*cos({\frac {k*\pi }{2n}}))^{2}}{a^{2}}}+{\frac {(b*sin({\frac {k*\pi }{2n}}))^{2}}{b^{2}}}=cos({\frac {k*\pi }{2n}})^{2}+sin({\frac {k*\pi }{2n}})^{2}=1}$ (since ${\displaystyle cos(x)^{2}+sin(x)^{2}=1}$)

b)

for x coordinate:

${\displaystyle {\frac {1}{4n}}\sum _{k=1}^{4n}a*cos({\frac {k*\pi }{2n}})={\frac {a}{4n}}\sum _{k=1}^{4n}cos({\frac {k*\pi }{2n}})=0}$

(since cosine in interval between 0 and 2 * pi has a mean of 0)

for y coordinate:

${\displaystyle {\frac {1}{4n}}\sum _{k=1}^{4n}b*sin({\frac {k*\pi }{2n}})={\frac {b}{4n}}\sum _{k=1}^{4n}sin({\frac {k*\pi }{2n}})=0}$

(since sinus in interval between 0 and 2 * pi has a mean of 0)

c)

since mean is 0, nothing has to be subtracted. So cov matrix will be ${\displaystyle {\frac {1}{4n}}\sum _{k=1}^{4n}{\begin{pmatrix}a^{2}*cos({\frac {k*\pi }{2n}})^{2}&ab*cos({\frac {k*\pi }{2n}})*sin({\frac {k*\pi }{2n}})\\ab*cos({\frac {k*\pi }{2n}})*sin({\frac {k*\pi }{2n}})&b^{2}*sin({\frac {k*\pi }{2n}})^{2}\end{pmatrix}}={\begin{pmatrix}{\frac {a^{2}}{4n}}*\sum _{k=1}^{4n}cos({\frac {k*\pi }{2n}})^{2}&0\\0&{\frac {b^{2}}{4n}}*\sum _{k=1}^{4n}sin({\frac {k*\pi }{2n}})^{2}\end{pmatrix}}}$ = (two sums are the same for sure, but I am not sure which value they sum up to, let's call it y) ${\displaystyle {\begin{pmatrix}{\frac {a^{2}*y}{4n}}&0\\0&{\frac {b^{2}*y}{4n}}\end{pmatrix}}}$

Note: ${\displaystyle \sum _{k=1}^{4n}\cos({\frac {k*\pi }{2n}})^{2}=2\sum _{k=1}^{2n}\cos({\frac {k*\pi }{2n}})^{2}(\because \cos({\frac {k*\pi }{2n}})=\cos({\frac {(4n-k)*\pi }{2n}}))}$

${\displaystyle \because \cos({\frac {k*\pi }{2n}})^{2}+\cos({\frac {(2n-k)*\pi }{2n}})^{2}=\cos({\frac {k*\pi }{2n}})^{2}+\cos(\pi -{\frac {k*\pi }{2n}})^{2}=\cos({\frac {k*\pi }{2n}})^{2}+\sin({\frac {k*\pi }{2n}})^{2}=1}$ ${\displaystyle \therefore \sum _{k=1}^{2n}\cos({\frac {k*\pi }{2n}})^{2}=n\therefore \sum _{k=1}^{4n}\cos({\frac {k*\pi }{2n}})^{2}=2n}$

The final result should be ${\displaystyle {\begin{pmatrix}{\frac {a^{2}}{2}}&0\\0&{\frac {b^{2}}{2}}\end{pmatrix}}}$

d)

since we know that a > b, the first principal vector is ${\displaystyle (1,0)^{T}}$ with eigenvalue ${\displaystyle {\frac {a^{2}y}{4n}}}$

e)

it just keeps the x-coordinate, so ${\displaystyle a*cos({\frac {2k*\pi }{4n}})}$

f)

reconstruction error = y coordinate = ${\displaystyle b*sin({\frac {2k*\pi }{4n}})}$

Remark: I think the reconstruction error refers to the total error of points, which is the sum of unkeeped eigenvalues. That is ${\displaystyle {\frac {b^{2}}{2}}}$

## 1.2 PCA in 3D

a)

${\displaystyle {\frac {2}{2n+1}}*\sum _{i=1}^{n}{\begin{pmatrix}4i^{2}&-10i^{2}&6i^{2}\\-10i^{2}&25i^{2}&-15i^{2}\\6i^{2}&-15i^{2}&9i^{2}\end{pmatrix}}={\frac {n(n+1)}{3}}*{\begin{pmatrix}4&-10&6\\-10&25&-15\\6&-15&9\end{pmatrix}}}$

then rewrite this according to given formula.

b)

The first principal component is ${\displaystyle {\begin{pmatrix}2\\-5\\3\\\end{pmatrix}}}$ or normalized ${\displaystyle {\begin{pmatrix}{\frac {2}{\sqrt {38}}}\\{\frac {-5}{\sqrt {38}}}\\{\frac {3}{\sqrt {38}}}\\\end{pmatrix}}}$

Its corresponding eigenvalue is ${\displaystyle 38*{\frac {n(n+1)}{3}}}$

The easiest way to arrive at these values is realizing a priori that the points lie on a line and therefore its direction must be the eigenvector corresponding to the only non-zero eigenvalue. Then you can just multiply the covariance matrix with any direction vector on the line and determine how much its length is scaled up by the linear transformation. If you don't realize you can calculate the determinant of the matrix minus lambda on the diagonal and set it to zero, but surely nobody would ever do that... Who me? Noo...

c)

normalize eigenvector first and then multiply: ${\displaystyle ({\frac {2}{{\sqrt {(}}38)}},{\frac {-5}{{\sqrt {(}}38)}},{\frac {3}{{\sqrt {(}}38)}})*(2i,-5i,3i)^{T}={\sqrt {38}}*i}$

d)

The reconstruction error is 0, since the data can be perfectly reconstructed (rank of covariance matrix is 1).

## 1.3 SVD for Solving Linear Systems

a)

${\displaystyle min_{x}||Ax-b||_{2}=min_{x}||U\Sigma V^{T}x-b||_{2}}$ we can multiply this by ${\displaystyle U^{T}}$ since it is an orthogonal matrix and this won't change the length.

${\displaystyle =min_{x}||U^{T}U\Sigma V^{T}x-U^{T}b||_{2}=min_{x}||\Sigma V^{T}x-U^{T}b||_{2}}$

b)

The problem is convex, so any locally optimal solution will be the global optimum, thus it is unique. To minimize, take the derivative and set it to zero: ${\displaystyle 2(\Sigma ^{T}\Sigma y-\Sigma ^{T})=0\rightarrow \Sigma ^{T}\Sigma y=\Sigma ^{T}x.}$

We can leave out all the dimensions that are set to zero in ${\displaystyle \Sigma }$ since they will be multiplied by 0 on both sides anyways. Let's take ${\displaystyle \Sigma ^{'}}$ as the square (invertible) matrix that consists of the nonzero entries. We now have: ${\displaystyle (\Sigma ^{'})^{T}\Sigma ^{'}y'=(\Sigma ^{'})^{T}c'\rightarrow (\Sigma ^{'})y'=c'\rightarrow y=(\Sigma ^{'})^{T}c'=\Sigma ^{T}c}$

c)

From a we know that we can instead optimize ${\displaystyle ||\Sigma V^{T}x-u^{T}b||}$. If we set ${\displaystyle y=V^{T}x}$ and ${\displaystyle c=U^{T}b}$, then we get ${\displaystyle y^{*}=\Sigma ^{T}U^{T}b}$ and therefore ${\displaystyle x^{*}=V\Sigma ^{T}U^{T}b}$

## 1.4 SVD Properties

a) really not sure what they expect here?

${\displaystyle A=U\Sigma V^{T}=\sum _{i=1}^{n}\sigma _{i}u_{i}v_{i}^{T}}$

b)

${\displaystyle AA^{T}=U\Sigma V^{T}(U\Sigma V^{T})^{T}=U\Sigma V^{T}V\Sigma U^{T}=U\Sigma \Sigma U^{T}=U\Sigma ^{2}U^{T}}$

so eigenvalues are ${\displaystyle \Sigma ^{2}}$

${\displaystyle A^{T}A=(U\Sigma V^{T})^{T}U\Sigma V^{T}=V\Sigma U^{T}U\Sigma V^{T}=V\Sigma \Sigma V^{T}=V\Sigma ^{2}V^{T}}$

so eigenvalues are ${\displaystyle \Sigma ^{2}}$， which is the same as before.

c)

A being a square symmetric matrix means that eigenvectors are orthogonal.

${\displaystyle A^{k}=(U\Sigma V^{T})^{k}=U\Sigma ^{k}V^{T}}$

# 2 Optimization, NMF, pLSA and Word Embedding

## 2.1 Optimization

a)

Proof by contradiction: Assume we have a local minimum that is not the global minimum, so ${\displaystyle \exists \epsilon >0}$ so that ${\displaystyle \forall x\in B_{\epsilon }({\hat {x}})f({\hat {x}})\leq f(x)}$ but ${\displaystyle \exists x^{*}}$ so that ${\displaystyle f(x^{*}).

We know that our function is convex, so it must hold that ${\displaystyle f(\alpha {\hat {x}}+(1-\alpha )x^{*})\leq \alpha f({\hat {x}})+(1-\alpha )f(x^{*})}$.

Let's now choose ${\displaystyle \alpha }$ so that ${\displaystyle \alpha {\hat {x}}+(1-\alpha )x^{*}}$ ends up in the ball around ${\displaystyle {\hat {x}}}$, so we know that ${\displaystyle f(\alpha {\hat {x}}+(1-\alpha )x^{*})\geq f({\hat {x}})}$.

We also know since ${\displaystyle f(x^{*}), that ${\displaystyle \alpha f({\hat {x}})+(1-\alpha )f(x^{*})

So we get ${\displaystyle f(\alpha {\hat {x}}+(1-\alpha )x^{*})\geq f({\hat {x}})>\alpha f({\hat {x}})+(1-\alpha )f(x^{*})}$ which contradicts the equation stated in the beginning (convexity).

b)

Given in the question we have the smoothness condition:

${\displaystyle -\nabla f(x)^{\top }(x-x^{\star })\leq -{\frac {||\nabla f(x)||_{2}^{2}}{2\beta }}}$

or equivalently:

${\displaystyle -\langle \nabla f(x),x-x^{\star }\rangle \leq -{\frac {1}{2\beta }}||\nabla f(x)||_{2}^{2}}$

And then we have the relation between ${\displaystyle \gamma }$ and ${\displaystyle \beta }$:

${\displaystyle \gamma \leq {\frac {1}{\beta }}}$

${\displaystyle {\frac {\gamma }{2}}\leq {\frac {1}{2\beta }}}$

${\displaystyle -{\frac {1}{2\beta }}\leq -{\frac {\gamma }{2}}}$

${\displaystyle -{\frac {1}{2\beta }}C\leq -{\frac {\gamma }{2}}C}$ for ${\displaystyle C\geq 0}$

Combined we see that:

${\displaystyle -\langle \nabla f(x),x-x^{\star }\rangle \leq -{\frac {1}{2\beta }}||\nabla f(x)||_{2}^{2}\leq -{\frac {\gamma }{2}}||\nabla f(x)||_{2}^{2}}$

And in particular if we choose the point ${\displaystyle x_{t}}$ for the generic ${\displaystyle x}$ we get the following true expression which we will work towards later:

${\displaystyle -\langle \nabla f(x_{t}),x_{t}-x^{\star }\rangle \leq -{\frac {\gamma }{2}}||\nabla f(x_{t})||_{2}^{2}}$

That line is our goal. Starting from the expression the question asks us to prove we can show equivalence to the goal and thereby conclude the proof.

${\displaystyle ||x_{t+1}-x^{\star }||_{2}^{2}\leq ||x_{t}-x^{\star }||_{2}^{2}}$

Expanding squared two norms

${\displaystyle ||x_{t+1}||_{2}^{2}+||x^{\star }||_{2}^{2}-2\langle x_{t+1},x^{\star }\rangle \leq ||x_{t}||_{2}^{2}+||x^{\star }||_{2}^{2}-2\langle x_{t},x^{\star }\rangle }$

${\displaystyle ||x_{t+1}||_{2}^{2}-2\langle x_{t+1},x^{\star }\rangle \leq ||x_{t}||_{2}^{2}-2\langle x_{t},x^{\star }\rangle }$

Replacing ${\displaystyle x_{t+1}}$ with ${\displaystyle x_{t}}$ and the gradient descent update step

${\displaystyle ||x_{t}-\gamma \nabla f(x_{t})||_{2}^{2}-2\langle x_{t}-\gamma \nabla f(x_{t}),x^{\star }\rangle \leq ||x_{t}||_{2}^{2}-2\langle x_{t},x^{\star }\rangle }$

Using linearity of scalar product

${\displaystyle ||x_{t}-\gamma \nabla f(x_{t})||_{2}^{2}-2\langle x_{t},x^{\star }\rangle +2\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \leq ||x_{t}||_{2}^{2}-2\langle x_{t},x^{\star }\rangle }$

${\displaystyle ||x_{t}-\gamma \nabla f(x_{t})||_{2}^{2}+2\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \leq ||x_{t}||_{2}^{2}}$

Expanding norm again

${\displaystyle ||x_{t}||_{2}^{2}+||\gamma \nabla f(x_{t})||_{2}^{2}-2\langle x_{t},\gamma \nabla f(x_{t})\rangle +2\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \leq ||x_{t}||_{2}^{2}}$

${\displaystyle ||\gamma \nabla f(x_{t})||_{2}^{2}-2\langle x_{t},\gamma \nabla f(x_{t})\rangle +2\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \leq 0}$

${\displaystyle ||\gamma \nabla f(x_{t})||_{2}^{2}-2\left(\langle x_{t},\gamma \nabla f(x_{t})\rangle -\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \right)\leq 0}$

${\displaystyle ||\gamma \nabla f(x_{t})||_{2}^{2}-2\left(\langle \gamma \nabla f(x_{t}),x_{t}\rangle -\langle \gamma \nabla f(x_{t}),x^{\star }\rangle \right)\leq 0}$

Using linearity of scalar product

${\displaystyle ||\gamma \nabla f(x_{t})||_{2}^{2}-2\langle \gamma \nabla f(x_{t}),x_{t}-x^{\star }\rangle \leq 0}$

${\displaystyle -2\langle \gamma \nabla f(x_{t}),x_{t}-x^{\star }\rangle \leq -||\gamma \nabla f(x_{t})||_{2}^{2}}$

Moving the ${\displaystyle \gamma }$ out

${\displaystyle -2\gamma \langle \nabla f(x_{t}),x_{t}-x^{\star }\rangle \leq -\gamma ^{2}||\nabla f(x_{t})||_{2}^{2}}$

Dividing by ${\displaystyle 2\gamma }$ which is positive so the direction doesn't flip

${\displaystyle -\langle \gamma \nabla f(x_{t}),x_{t}-x^{\star }\rangle \leq -{\frac {\gamma }{2}}||\gamma \nabla f(x_{t})||_{2}^{2}}$

And that's our goal from above. QED

c)

if ${\displaystyle ||x_{t+1}-x^{*}||_{2}^{2}=||x_{t}-\gamma \nabla f(x_{t})-x^{*}||_{2}^{2}=||x_{t}-x^{*}||_{2}^{2}}$ then we know that ${\displaystyle \gamma \nabla f(x_{t})=0}$, so we must have reached the global optimum.

d)

This is similar as c), but since the function is not convex anymore it is just an optimum, not necessarily the global optimum.

e)

${\displaystyle \nabla _{k}f(x_{t})=\nabla _{k}{\frac {1}{2}}||y-Ax||_{2}^{2}+\lambda ||x||_{1}=(A^{T}Ax)_{k}-(A^{T}y)_{k}+\lambda x_{k}}$

## 2.2 pLSA and Word Embedding

a)

1) how often does word j occur in document i

2) ${\displaystyle {\begin{pmatrix}&the&king&built&castle&rode&to&likes\\d_{1}&2&1&1&1&0&0&0\\d_{2}&2&1&0&1&1&1&0\\d_{3}&2&1&0&1&0&0&1\\\end{pmatrix}}}$

b)

1) ${\displaystyle p(w|d,z)=p(w|z)}$

2) ${\displaystyle \sum _{i,j\in X}log\sum _{z=1}^{K}v_{zj}u_{zi}}$ with ${\displaystyle \sum _{j}^{N}v_{zj}=\sum _{z}^{K}u_{zi}=1}$

3) lower bound: ${\displaystyle \sum _{i,j\in X}\sum _{z=1}^{K}q_{zij}(log(v_{zj})+log(u_{zi})-log(q_{zij}))}$ update rule: ${\displaystyle q_{zij}={\frac {v_{zj}*u_{zi}}{\sum _{k=1}^{K}v_{kj}*u_{ki}}}}$

c)

${\displaystyle {\begin{pmatrix}&the&king&built&castle&rode&to&likes\\the&0&3&1&3&0&1&1\\king&x&0&1&0&1&0&1\\built&x&x&0&0&0&0&0\\castle&x&x&x&0&0&0&0\\rode&x&x&x&x&0&1&0\\to&x&x&x&x&x&0&0\\likes&x&x&x&x&x&x&0\\\end{pmatrix}}}$

d)

1) ${\displaystyle f(n)=min(1,({\frac {n}{n_{max}}})^{\alpha }),\alpha \in [0,1]}$

2) to make sure that words that occur more often together have a larger weight, but also that the weight is capped at some point.

## 2.3 True/False Questions

a) T b) F c) T d) F e) F f) T

(For b) Isn't it true? The Frobenius norm is convex for ${\displaystyle \|A\|_{F}^{2}}$. The question says it is convex in ${\displaystyle U^{T}V}$, not on ${\displaystyle U,V}$ specifically.)

(For f) False? We don't need step (b)?

# 3 Neural networks and clustering

## 3.1 Neural Networks

a)

${\displaystyle {\frac {\partial F}{\partial w}}=\sigma (\sum _{i=1}^{p}v_{i}\sigma (\sum _{j=1}^{n}w_{i,j}x_{j}+b_{i}))(1-\sigma (\sum _{i=1}^{p}v_{i}\sigma (\sum _{j=1}^{n}w_{i,j}x_{j}+b_{i})))*(\sum _{i=1}^{p}v_{i}\sigma (\sum _{j=1}^{n}w_{i,j}x_{j}+b_{i})(1-\sigma (\sum _{j=1}^{n}w_{i,j}x_{j}+b_{i}))*\sum _{j=1}^{n}x_{j})}$

I'm not sure about the summation. Here's my solution

${\displaystyle {\frac {d}{dW_{ij}}}F=\sigma \prime (\sum _{m}v_{m}\sigma (\sum _{n}W_{mn}x_{n}+b_{m}))(v_{i}\sigma \prime (\sum _{m}W_{im}x_{m}+b_{i}))x_{j}}$

b)

${\displaystyle \sigma '(x)\leq {\tfrac {1}{4}}}$ thus ${\displaystyle |F(x,\ldots )-F(y,\ldots )|\leq {\tfrac {1}{4}}\sum _{i=1}^{p}|v_{i}|{\tfrac {1}{4}}\sum _{j=1}^{n}|W_{ij}||x_{j}-y_{j}|}$ then by Cauchy-Schwartz ${\displaystyle \ldots \leq {\tfrac {1}{16}}\sum _{i=1}^{p}|v_{i}|\lVert W_{i}\rVert _{2}\lVert x-y\rVert _{2}}$

c)

${\displaystyle \sigma (V^{T}\sigma (Wx+b))=\sigma (V'^{T}\sigma (W'\Sigma ^{-{\frac {1}{2}}}(x-\mu )+b')=\sigma (V'^{T}\sigma (W'\Sigma ^{-{\frac {1}{2}}}x-W'\Sigma ^{-{\frac {1}{2}}}\mu +b'))}$ so if we choose ${\displaystyle b'=b+W'\Sigma ^{-{\frac {1}{2}}}\mu }$ and ${\displaystyle W'=W*\Sigma ^{\frac {1}{2}}}$, we get the desired result.

d)

It takes O(n) because we have to differentiate with respect to every weight.

e)

1) extract interesting features from image

2) reduce size of image

f)

Each round you go from nxn to (n-2)x(n-2). So wee need (rounded up) (n-3)/2 iterations.

## 3.2 True/False Questions

a) T b) F c) T d) F e) F

## 3.3 Clustering

a)

1) compute cluster centers:

${\displaystyle m_{1}=(9+(-2)+5+8)*{\frac {1}{4}}=5}$

${\displaystyle m_{2}=(6+1+(-3)+4)*{\frac {1}{4}}=2}$

2) assign data points:

${\displaystyle c_{1}={9,6,5,4,8},c_{2}={(-2),1,(-3)}}$

3) compute cluster centers:

${\displaystyle m_{1}=(9+6+5+4+8)*{\frac {1}{5}}=6.4}$

${\displaystyle m_{2}=((-2)+(-3)+1)*{\frac {1}{3}}=(-1.3)}$

4) assign data points:

${\displaystyle c_{1}={9,6,5,4,8},c_{2}={(-2),1,(-3)}}$

From now on, nothing will change anymore.

b)

Intuitive approach, no idea if that might be true: ${\displaystyle \sum _{k=1}^{K}\pi _{k}*\Sigma _{k}}$

One guess:

${\displaystyle Var(p_{\theta }(x))=\sum _{k=1}^{K}Var(\pi _{k}{\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum _{k=1}^{K}\pi _{k}^{2}Var({\mathcal {N}}(x|\mu _{k},\Sigma _{k}))=\sum _{k=1}^{K}\pi _{k}^{2}\Sigma _{k}}$

c)

mean: ${\displaystyle d*k}$covariance: ${\displaystyle d^{2}*k}$ total: ${\displaystyle d*k+d^{2}*k}$

(Remarks:

1. do we need to distinguish whether the covariance matrix is diagonal?

2. For the ${\displaystyle \{\pi _{k}\}}$, we still require ${\displaystyle K-1}$ parameters?)

## 3.4 K-means vs. Gaussian mixture modeel (GMM)

a)

GMM: ${\displaystyle p(z)=\pi _{k=1}^{K}\pi _{k}^{z_{k}}}$ K-means: clusters are initialized uniformly at random, so ${\displaystyle p(z)=1/K}$

b)

In k-means clusters are spherical, so covariances are ${\displaystyle \Sigma _{j}=\sigma ^{2}*I}$. In GMM, clusters can be ellipsoids, so covariance matrix does not need to be diagonal.

c)

Two very stretched ellipses next to each other would be difficult for k-means, since it expect spheres. It should not be a problem for GMM though, since there the shape of the ellipsoid can be learned as well.

d)

A calculation of each step is cheaper in K-means. In addition to that, K-means needs less iterations.

# 4.1 1-D Fourier transform

a)

1b 2a

b)

for signal b, since it has a lot of high frequencies and a has none.

c)

Because then we have to also save the permutation, which will probably cost as much as saving the whole signal.

# 4.2 2-D Fourier transform

a)

Image b, since it has less high frequencies than image a (which has scarf and tablecloth)

b)

a, since b has a lot of high frequencies in the original image which will be lost. Image a originally does not have a lot of high frequencies (see a), so denoising will work well.

# 4.3 sparse coding with an over-complete dictionary

a)

draw four vectors from middle cluster to the outer clusters.

b)

Each center corresponds to one atom and additionally one center corresponds to no atom at all (0,0).

# 4.4 Compressed Sensing

a)

No, there are no non-zero elements.

b)

Yes it is, only the on counterfeit coin will have a non-zero element.

c)

${\displaystyle y=u^{T}w}$

d)

${\displaystyle Y=Uw}$

e)

${\displaystyle min||U^{-1}y-\delta ||_{0}}$

f)

${\displaystyle min||U^{-1}y-\delta ||_{1}}$

g)

${\displaystyle k\geq c*\log {n}}$ for some constant c