# Lösungsvorschlag Computational Intelligence Lab FS10

## 1. Clustering

### a)

${\displaystyle U^{\left(0\right)}=\left[{\begin{array}{*{20}c}1&3\\0&3\\\end{array}}\right]}$

${\displaystyle Z^{\left(1\right)}=\left[{\begin{array}{*{20}c}1&0&0&1&0\\0&1&1&0&1\\\end{array}}\right]}$

${\displaystyle U^{\left(1\right)}=\left[{\begin{array}{*{20}c}0.5&{\frac {7}{3}}\\0&{\frac {8}{3}}\\\end{array}}\right]}$

### b)

The idea is to differentiate ${\displaystyle J(\mathbf {U} ,\mathbf {Z} )}$ by ${\displaystyle \mathbf {u} _{k}}$:

${\displaystyle {d \over d\mathbf {u} _{k}}\sum _{n=1}^{N}\mathbf {z} _{k,n}\ ||\mathbf {x} _{n}-\mathbf {u} _{k}||_{2}^{2}}$

${\displaystyle ={d \over d\mathbf {u} _{k}}\sum _{n=1}^{N}\mathbf {z} _{k,n}(\mathbf {x} _{n}-\mathbf {u} _{k})^{T}(\mathbf {x} _{n}-\mathbf {u} _{k})}$

${\displaystyle ={d \over d\mathbf {u} _{k}}\sum _{n=1}^{N}\mathbf {z} _{k,n}(\mathbf {x} _{n}^{T}\mathbf {x} _{n}-\mathbf {x} _{n}^{T}\mathbf {u} _{k}-\mathbf {u} _{k}^{T}\mathbf {x} _{n}+\mathbf {u} _{k}^{T}\mathbf {u} _{k})}$

(we differentiate according to: http://fourier.eng.hmc.edu/e161/lectures/algebra/node7.html )

${\displaystyle =\sum _{n=1}^{N}\mathbf {z} _{k,n}(-\mathbf {x} _{n}-\mathbf {x} _{n}+2\mathbf {u} _{k})\quad ==0}$

${\displaystyle \mathbf {u} _{k}={\sum _{n=1}^{N}z_{k,n}\mathbf {x} _{n} \over \sum _{n=1}^{N}z_{k,n}}}$

### c)

• We first sample the cluster index ${\displaystyle k}$ according to the probabilites ${\displaystyle \mathbf {\pi } _{k}}$
• Then we sample the data point ${\displaystyle x}$ from the distribution ${\displaystyle p_{GMM}(x|\mathbf {\pi } _{k},\mathbf {\mu } _{k},\sigma )}$

### d)

function x=sample_gmm(pis, mus, sigma)
t = rand()
i = 0
s = 0
for k in pis
i = i+1
s = s + k % Try to fit k to a pis(k)
if t <= s
break
end
end
% Select a pis we generated from t
x = rand_gaussian(mus(i), sigma)

Alternatively:

k = find(cumsum(pis) >= rand(1), 1);
x = rand_gaussian(mus(k), sigma);

### e)

#### Proposal #1

${\displaystyle P(x==n)=I_{n<5}\cdot 0.2{1 \over 4}+0.8{1 \over 6}}$

Where ${\displaystyle I}$ is ${\displaystyle 1}$ if ${\displaystyle n<5}$ otherwise ${\displaystyle I=0}$.

#### Proposal #2

${\displaystyle P(x=n)=\left\{{\begin{array}{cc}0.2\cdot {\frac {1}{4}}+0.8\cdot {\frac {1}{6}}&,n~\leq 4\\0.8\cdot {\frac {1}{6}}&,~n>4\end{array}}\right.=\left\{{\begin{array}{cc}{\frac {11}{60}}&,n~\leq 4\\{\frac {4}{30}}&,~n>4\end{array}}\right.}$

#### Proposal #3

The task is to "[...] not compute the values of all these events, but only give a symbolic expression." Accordingly I'd write

${\displaystyle P(S=4)=0.2P(S=6)=0.8P(X=i|S=4)={\frac {1}{4}}\;\forall i\in [1,4]P(X=i|S=6)={\frac {1}{6}}\;\forall i\in [1,6]P(X=n)={\begin{cases}P(X=n,S=4)+P(X=n,S=6)&{\text{if }}0 ${\displaystyle ={\begin{cases}P(X=n|S=4)P(S=4)+P(X=n|S=6)P(S=6)&{\text{if }}0

### f)

blue +: Gaussian Mixture Model

red x: K-Means

Reason: K-means is more sensitive to outliers. Additionally k-means can only consider points that are assigned entirely to its cluster for centroid estimation: a lot of the points that are close are not in the own cluster and therefore the centroids get pushed outside.

## 2. Dimensionality Reduction

### Singular Value Decomposition

1.

${\displaystyle D=\left[{\begin{array}{*{20}c}{45}&0&0&0&0&0&0\\0&{20}&0&0&0&0&0\\0&0&{17}&0&0&0&0\\0&0&0&1&0&0&0\\0&0&0&0&{0.7}&0&0\\0&0&0&0&0&{0.2}&0\\\end{array}}\right]}$

2.

${\displaystyle A=\sum \limits _{k=1}^{Rank\left(A\right)}{d_{k,k}u_{k}(v_{k})^{T}}}$

Note: ${\displaystyle (v_{k})^{T}\neq v_{k}^{T}}$ according to the notation used in the exam (last page)

3.

Choose ${\displaystyle k=3}$ because the first three singular values are very strong while the others are very weak.

${\displaystyle {\tilde {A}}=\sum \limits _{k=1}^{3}{d_{k,k}u_{k}(v_{k})^{T}}}$

4.

In ${\displaystyle U}$: box out the left 3 columns

In ${\displaystyle V^{T}}$: box out the upper 3 rows

In ${\displaystyle D}$: box out the upper left ${\displaystyle 3x3}$ matrix

5.

Approximation error under Euclidean matrix norm for ${\displaystyle k=3}$ is the value of ${\displaystyle d_{k+1}=d_{4}=1}$

6.

${\displaystyle {\tilde {A}}_{i,j}=\sum \limits _{k=1}^{3}{d_{k,k}u_{i,k}v_{j,k}}}$

7.

${\displaystyle A_{5,3}=U_{5,1}\cdot D_{1,1}\cdot V_{1,3}+U_{5,2}\cdot D_{2,2}\cdot V_{2,3}+U_{5,3}\cdot D_{1,3}\cdot V_{3,3}}$

${\displaystyle =0.3\cdot 45\cdot 0.37+\left({-0.6}\right)\cdot 20\cdot \left({-0.38}\right)+\left({-0.1}\right)\cdot 17\cdot 0.47=8.756}$

⇒ user 5 likes item 3

8.

2nd and 5th items are identical. They have the same vector in ${\displaystyle V^{T}}$, indicating that they have the same characteristics (concepts).

The different values in their 7th component is irrelevant because ${\displaystyle Rank\left(A\right)<7}$. These values vanish anyway by multiplication with ${\displaystyle D_{:,7}}$

### Principal Component Analysis

1.

${\displaystyle H}$ is covariance matrix of ${\displaystyle X}$, i.e. ${\displaystyle H={1 \over N}XX^{T}}$ ( resp. ${\displaystyle H=\Sigma }$ )

2.

${\displaystyle H={1 \over N}UDV^{T}\left({UDV^{T}}\right)^{T}={1 \over N}UD\underbrace {V^{T}V} _{=I}D^{T}U^{T}={1 \over N}UDD^{T}U^{T}}$ (Note: previously it was stated that this is ${\displaystyle ={1 \over N}UD^{2}U^{T}}$ which is not true, as nowhere is mentioned that ${\displaystyle A}$ is square)

Giving the eigenvalue decomposition of ${\displaystyle H}$

3.

Eigenvectors are the columns of ${\displaystyle U}$

Eigenvalues are the diagonal elements of ${\displaystyle {1 \over N}DD^{T}}$

## 3. Role Mining

a)

${\displaystyle {1 \over ND}||X-U\otimes Z||_{0}}$

Fraction of wrongly reconstructed permissions. Low=good, High=bad

b.1

High GE: Many wrongly predicted permissions -> bad. We could not predict the permissions well based on a subset of all permissions.

Low GE: Good

b.2

Same formula, but:

Deviation is based on the reconstruction of the original data set.

GE is based on the reconstruction of a new data set, where the reconstruction is based only on minimizing the deviation on a subset of permissions (prediction).

b.3

Quote from slides:

X is typically noisy, i.e., it contains permission assignments without an underlying role!

Therefore noisy means to have permissions without an underlying role and a GE=0 should not be possible.

c)

1.

${\displaystyle p_{\{1,2,3,4\}}}$

2.

Containing ${\displaystyle p_{1}:\,6}$

Not containing ${\displaystyle p_{1}:\,1=\{r_{2}\}}$

3.

${\displaystyle p(x_{5,2}=1|z_{\cdot ,2})=1-(1-0.9)\cdot (1-0.7)\cdot (1-0.5)=0.985}$

## 4. Deterministic Annealing

a) Correct order of code lines: 7 5 3 1 4 6 2

Note: This seems to be the EM Algorithm for Deterministic Annealing. EM Algorithms usually (the ones presented in 2015) perform first the E-step, then the M-step. Hence, I'd expect to see first line 3 followed by line 1. However, in 2015 this algorithm was not part of the lecture. Would be great if someone finds a source confirming one of both orders (.gregor)

Note: The order was wrong, it is first 3 then 1. See http://www.inf.ethz.ch/personal/alexeygr/slt15/lectures/slt15_lecture03.pdf page 15 and 17.

b) Quote by Kenneth Rose (p.6-7): Deterministic annealing can be intuitively understood as follows. Instead of making noisy moves on the given energy surface, we "incorporate" the noise into the energy function. We derive a sequence of effective energy functions that are parametrized by the temperature, the level of the noise. These effective cost functions will be very smooth at low ${\displaystyle T}$, where even very large barriers can be easily traversed, and as ${\displaystyle T}$ is increased they become more ragged and converge to the original energy at ${\displaystyle \to \infty }$. In fact, at ${\displaystyle T=0}$ the cost function will usually be convex, so that the global minimum of this function is easily obtained. Thus the deterministic annealing approach can be viewed as locating the global minimum at T=0, and tracking the minimum while gradually increasing ${\displaystyle T}$, by using conventional convex optimization methods at each temperature.

Note: Whereas the quote is correct, notice that it talks about a ${\displaystyle \beta }$ parameter, not ${\displaystyle T}$. We have ${\displaystyle 1/T=\beta }$. Therefore, the argument about T is the other way around (i.e that the surface is more smooth for ${\displaystyle T\to \infty }$ and more ragged at for ${\displaystyle T\to 0}$

## 5. Sparse Coding

### a)

#### Proposal #1

Two vectors in an inner product space V are orthogonal if their inner product is zero.

An orthogonal matrix is a square matrix with real entries whose column (or rows) are orthogonal unit vectors.

#### Proposal #2

Let ${\displaystyle \mathbf {U} =[\mathbf {u} _{1},\mathbf {u} _{2},\ldots ,\mathbf {u} _{m}]}$, i.e. ${\displaystyle \mathbf {u} _{i}}$ are column vectors of ${\displaystyle \mathbf {U} }$.

${\displaystyle \mathbf {U} }$ is orthogonal if

• its column vectors are orthonormal: ${\displaystyle \langle \mathbf {u} _{i},\mathbf {u} _{j}\rangle ={\begin{cases}0&{\text{if }}i\neq j\\1&{\text{if }}i=j\end{cases}}}$
• it is a square matrix

The exam requires three properties. I'd say:

1. Columns are pairwise orthogonal (case ${\displaystyle i\neq j}$)
2. Columns are unit vectors (case ${\displaystyle i=j}$)
3. Matrix is square

### b)

Reasons why each step holds:

-Definition of euclidean norm

-Property of transpose

-U orthogonal

-Property of I, definition of euclidean norm

We have shown that:

Apply an orthogonal matrix U on a vector X preserves the euclidean length.

### c)

Unnormalized:

${\displaystyle h_{1}={\begin{pmatrix}1\\1\\-1\\-1\end{pmatrix}}}$ ${\displaystyle h_{2}={\begin{pmatrix}1\\-1\\0\\0\end{pmatrix}}}$ ${\displaystyle h_{3}={\begin{pmatrix}0\\0\\1\\-1\end{pmatrix}}}$ ${\displaystyle h_{4}={\begin{pmatrix}1\\1\\1\\1\end{pmatrix}}}$

### d)

Coding: transform x into Haar basis

Thresholding: cut off small components

### e)

The sum of the normalized basis vectors ${\displaystyle h_{1-4}}$. The length of the sum of all normalized basis vectors is ${\displaystyle {\sqrt {D}}=2}$

This results in

${\displaystyle {\begin{pmatrix}1.7\\0.3\\0.7\\-0.7\end{pmatrix}}}$

Reason: The signal has to be represented with all 4 basis vectors. If we would represent this signal with 3 basis vectors, then the reconstruction error would increase.

### f)

[~, cur] = max(abs(U' * res));

and

res = res - coef * U(:, cur);

### g)

As the question states ${\displaystyle k=2}$, we only need to do 2 iterations.

Note that in the picture the residuum is directed in the wrong direction: the residuum must always point towards the real value, as it holds that: approx + residuum = real vector