# Lösungsvorschlag Computational Intelligence Lab FS12 (Repetition)

## Question 1: Dimensionality Reduction

### a)

  - No (we have zero error only if we use ALL eigenvalues or if the unused eigenvalues are zero)
- No ($A=A^{T}$ for symmetric matrices, so $U=V$ in SVD --> $A=UDU^{T}$. In PCA, $eig({\frac {1}{N}}\cdot (A-M)\cdot (A-M)^{T})=U\cdot L\cdot U^{T}$, with $L=D^{2}$)
- No (we do not have zero error in general)
- No (PCA also needs to select the k largest eigenvalues of the cov. matrix)
- Yes (The PCA objective can be formulated as minimizing the approx. error, L_2 is correct I think)


### b)

  1. Calculate SVD by hand:
1. Compute $A^{T}A$
2. Compute its eigenvalues (or just check that the square of the elements of $D$, which should be the singular values of $A^{T}A$ are indeed its eigenvalues), take roots of the absolute values and sort in descending order.
3. Construct $D$ and place them along its diagonal
4. Compute the eigenvectors of $A^{T}A$ using the eigenvalues. Place them along the columns of $V$ (in the correct order).

  2. $D$ is not invertible, therefore calculate $AV=UD$. Left side is computable, right side place variables for unknown $u_{ij}$.
For the remaining undefined unknowns, calculate them using the fact $UU'=I$.
We get $U={\begin{pmatrix}4/5&0&3/5\\0&1&0\\-3/5&0&4/5\\\end{pmatrix}}$
The last column needs to be orthogonal to the first two (e.g. cross product).


### c)

  1. each vector has 150 entries, there are 10*10 = 100 patches, so X is 150 x 100, and the covariance matrix is 150 x 150 (X*X^T dimensions)

  2. Compute X' = X - M, where M contains the mean in each column.
Compute the Covariance Matrix Sigma = (1/N)*X'*X'^T
Perform Eigenvalue Decomposition on Sigma, getting U*D*U^T.
For k <= 20 (*), let U_k be the matrix of the first k eigenvectors only
Compress X by calculating U_k^T*X' = Z'
Reconstruct X (with error) by calculating X'~ = U_k*Z', X~ = X'~ + M

     (*) So that we do not exceed 5000 bytes. Dimensions are $Z^{\prime }\in \mathbb {R} ^{k\times N}$,
$U_{k}^{T}\in \mathbb {R} ^{k\times D}$ $X^{\prime }\in \mathbb {R} ^{D\times N}$ with D = 150, N = 100.
Without saving the dictionary U_k and the mean, k <= 5000 / N = 5000 / 100 = 50.
With the dictionary and the mean k*N + K*D + D <= 5000 <=> k <= 19


### d)

$\mathbf {A} =\mathbf {U} \mathbf {D} \mathbf {U} ^{T}$ (since A is symmetric it can be factorized using Eigendecomposition, U is orthogonal), to show: If all values of D (eigenvalues) nonnegative, then A PSD.

Consider any $\mathbf {w}$. We can rewrite $\mathbf {w}$ as $\mathbf {Uz}$, i.e. perform a basis transformation to the eigenvector space of A. Then, $\mathbf {w} ^{T}\mathbf {Aw} =(\mathbf {U} \mathbf {z} )^{T}\mathbf {UDU} ^{T}(\mathbf {Uz} )=\mathbf {z} ^{T}\mathbf {U} ^{T}\mathbf {UDU} ^{T}\mathbf {Uz} =\mathbf {z} ^{T}\mathbf {Dz} \geq 0$ since $\mathbf {D}$ is a diagonal matrix with all values nonnegative.

#### Alternative Solution

$\mathbf {A} =\mathbf {U\Lambda U^{T}}$

Since all $\lambda _{i}\geq 0$, there exists $\mathbf {D}$, such that $\mathbf {DD^{T}=\Lambda }$

$\Rightarrow \mathbf {w^{T}Aw} =\mathbf {w^{T}U\Lambda U^{T}w} =\mathbf {w^{T}UDD^{T}U^{T}w} =\mathbf {(D^{T}U^{T}w)^{T}D^{T}U^{T}w} =\left\langle \mathbf {D^{T}U^{T}w} ,\mathbf {D^{T}U^{T}w} \right\rangle \geq 0$

## Question 2: Clustering

### a)

  - False (on the contrary, EM is much slower)
- False (soft assignments is a more general notion than GMM, which includes more than this particular mixture model)
- True (BIC scales logarithmically with the size of the data set, and AIC does not, so BIC > AIC for large enough N)


### b)

a: Since the two clusters on the top are close together, a stable clustering algorithm may put them together in one cluster, and will always put those two together, and never mix it with the bottom one. Thus, the result is stable, but consists of only 2 clusters. In b, the clusters have about the same distance, thus, a result with only 2 clusters cannot be stable (it will mix different clusters together, changing results every time).

### c)

#### 1

We have a component distribution, so we need latent variables for the probabilities that a certain document is drawn from a particular distribution. We define $z_{c}\in \{0,1\}$ with $\sum _{c=1}^{K}z_{c}=1$, and $P(z_{c}=1)=\pi _{c}$.

#### 2

By expectation of the latent variables, they mean given the observed data. Since $E[z_{c}|D]=p(z_{c}=1|D)$, we need to apply the formula

$p(A|B)={\frac {p(A)\cdot p(B|A)}{p(B)}}$

That is in our case:

$p(z_{c}=1|D)={\frac {p(z_{c}=1)\cdot p(D|z_{c}=1)}{p(D)}}={\frac {\pi _{c}\cdot f(D|\mu _{c})}{\sum _{j}\pi _{j}\cdot f(D|\mu _{j})}}$

This is analogous to the GMM case, but the mixture components are not Gaussians (but our special distribution).

#### 3

By calculating the parameters, they mean those that maximize the (log) likelihood. So we first rewrite the log-likelihood using our values for $p(z_{c}=1|D)$ from 2.

Then we differentiate the Lagrangian consisting of this log-likelihood and the conditions $\sum _{j}(\mu _{jc}=1)$, for which we need a lambda.

With $\gamma _{ic}=P(z_{c}=1|D_{i})$ we get $\mu _{jc}={\dfrac {-1}{\lambda _{c}}}\sum _{i}^{n}\gamma _{ic}F_{ij}$

And summing both sides over j: $\lambda _{c}=-\sum _{i}^{n}\gamma _{ic}\sum _{j}^{m}F_{ij}$

Therefore $\mu _{jc}={\dfrac {\sum _{i}^{n}\gamma _{ic}F_{ij}}{\sum _{i}^{n}\gamma _{ic}\sum _{j}^{m}F_{ij}}}$

#### 4

• Each ${\boldsymbol {\mu }}_{c}$ is a vector consisting of $m=100$ elements, each being a probability. Due to the hint in 3.) we know that $\forall c:\sum _{j}\mu _{c,j}=1$. Therefore, only 99 of the 100 elements is unknown/a variable (${\boldsymbol {\mu }}_{\text{last}}=1-\sum _{i\neq {\text{last}}}{\boldsymbol {\mu }}_{i}$).
• We have $K$ $\pi$'s. Due to $\sum _{k}\pi _{k}=1$ one $\pi$ is not a free variable, but can again be determined using the others. Further, $\pi _{1}=\pi _{2}$ so there is another one that is fixed. This leaves us with $K-1-1$ variable $\pi$'s.

Based on both observations we get $\kappa ({\boldsymbol {\theta }})=\underbrace {K\cdot (m-1)} _{{\text{number of unknown }}\mu _{c,j}}+\underbrace {(K-2)} _{{\text{number of unknown }}\pi _{k}}$

K = 3
AIC $=-\ln(p(\mathbf {X} |{\boldsymbol {\theta }}))+\kappa ({\boldsymbol {\theta }})=4000+3\cdot 99+(3-2)=4298$
BIC $=-\ln(p(\mathbf {X} |{\boldsymbol {\theta }}))+{\frac {1}{2}}\kappa ({\boldsymbol {\theta }})\ln(N)=4000+{\frac {1}{2}}\cdot (3\cdot 99+(3-2))\cdot \ln(2000)=4000+149\ln(2000)$
K = 5
AIC $=-\ln(p(\mathbf {X} |{\boldsymbol {\theta }}))+\kappa ({\boldsymbol {\theta }})=3500+5\cdot 99+(5-2)=3998$
BIC $=-\ln(p(\mathbf {X} |{\boldsymbol {\theta }}))+{\frac {1}{2}}\kappa ({\boldsymbol {\theta }})\ln(N)=3500+{\frac {1}{2}}\cdot (5\cdot 99+(5-2))\cdot \ln(2000)=3500+249\ln(2000)$

### d)

  1. $1-\beta _{d,k_{n}}=1-p(x_{dn}=1|\beta _{d,.},z_{.,n})$
1. (Alternative) $1-\prod _{k=1}^{K}\beta _{dk}^{z_{kn}}$ Explanation: The probability that at least 1 entry in u (of those marked by z) is 1.
2. $\prod _{n,d}P_{N}(x_{d_{n}}|r)^{(1-\xi _{dn})}P_{S}(x_{dn}|\beta _{d\cdot ,z\cdot n})^{\xi _{dn}}=\prod _{n,d}r^{x_{dn}(1-\xi _{dn})}(1-r)^{(1-x_{dn})(1-\xi _{dn})}P_{S}(x_{dn}|\beta _{d\cdot ,z\cdot n})^{\xi _{dn}}$


## Question 3: Sparse Coding

### a)

 1. ? $A^{-1}=A^{T}$, computing inverse transform is efficient.
2. Transformation is energy (norm) preserving, i.e. $\left|Ux\right|=\left|x\right|$


(I consider 1. to be the important point here, since the question is about a "computational issue". Inverting a matrix is usually inefficient and maybe even ill-conditioned, so this is an issue, while 2. isn't really)

### b)

  - Yes (U is invertible with U^-1 = U^T, so rank(U) = D)
- No (once again since U is invertible)
- Yes (Since the columns of U have norm 1, multiplication with a vector doesn't change its norm)
- No
- Yes


### c)

  - Yes (these should be sparse in some frequency domain)
- Yes (sparse in a component or eigendecomposition)
- No (this would mean compressing a random sequence, which is impossible)
- Yes (these follow patterns and are somewhat predictable, so not random)
- No (a good tennis player will try to be as unpredictable, i.e. random, as possible)


### d)

It was discussed in the exam revision session how to solve this type of exercise.

1.) Represent the signal as a vector, [4 2 2 5].

2.) Perform a basis transformation into the Haar Wavelet Basis by taking scalar products with the Haar basis function vectors [1 1 1 1], [1 1 -1 -1], [1 -1 0 0], [0 0 1 -1]. This gives: 13/4, -1/4, 1, -3/2.

3.) Remove the basis function with the smallest absolute value. This is [1 1 -1 -1] in this case.

#### Alternative Solution

I'm (.gregor) relatively sure the above solution is wrong. However, someone with expertise, I don't have any, should select the right one. The following statements are in line with G. Strang's Linear Algebra book and to the way basis transformation was done in the lecture. According to http://grail.cs.washington.edu/pub/stoll/wavelet1.pdf the method of just throwing away the smallest absolute coefficient works only if we use the orthonormal basis.

• The values of of the given vector $\mathbf {x}$ is $\mathbf {x} =[4,2,2,5]^{T}$ (read out the value of each section between the vertical lines).
• The matrix $\mathbf {W} =[\mathbf {w} _{1},\mathbf {w} _{2},\mathbf {w} _{3},\mathbf {w} _{4}]={\begin{pmatrix}0.5&0.5&{\frac {1}{\sqrt {2}}}&0\\0.5&0.5&-{\frac {1}{\sqrt {2}}}&0\\0.5&-0.5&0&{\frac {1}{\sqrt {2}}}\\0.5&-0.5&0&-{\frac {1}{\sqrt {2}}}\end{pmatrix}}$ contains the normalized wavelet basis vectors $\mathbf {w} _{i}$ and is the basis in which we want to represent our measurement $\mathbf {x}$
• To represent our measurement $\mathbf {x}$ in the basis $\mathbf {W}$ we need to calculate the coefficients $\mathbf {c} =\mathbf {W} ^{-1}\mathbf {x} =\mathbf {W} ^{T}\mathbf {x}$
• This results in $\mathbf {c} =\mathbf {W} ^{-1}\mathbf {x} ={\begin{pmatrix}6.5,-0.5,{\sqrt {2}},-{\frac {3}{2}}{\sqrt {2}}\end{pmatrix}}^{T}$
• The smallest absolute value in $\mathbf {c}$ is $-0.5$ and corresponds to the second basis, i.e. ${\begin{pmatrix}1&1&-1&-1\end{pmatrix}}^{T}$. This is the one we throw away. Again, we can only use the coefficient with smallest magnitude if we use an orthonormal basis.

My two cents, if you are still around: notice the orthonormal basis is just a linear scaling of the orthogonal basis used by the previous answer. Thus, the vector for which the smallest value (in magnitude) is achieved with scaling (in our case, the second column of $\mathbf {W}$ is the same as the one for which the smallest value is obtained without scaling. This is also consistent with the given hint: computing the actual coefficients is not necessary. Draw the sequence ${\begin{bmatrix}1&1&-1&-1\end{bmatrix}}$.

### e)

1. Yes

2. $B_{1}=B_{2}$, with ${\begin{pmatrix}0&1\end{pmatrix}},{\begin{pmatrix}1&0\end{pmatrix}}$ , then ${\begin{pmatrix}1/{\sqrt {2}}&1/{\sqrt {2}}\end{pmatrix}}$ and a fourth vector orthogonal to that.

3. $1/{\sqrt {2}}$

4. $1/{\sqrt {2}}$

5. The problem reduces to minimizing the cosine of the angle between the existing orthonormal vectors and the new vector (since it is normalized). This is achieved by setting the angle to 45 degrees, as otherwise one of the two angles between the new vector and an existing one has a larger cosine. Then, adding a fourth vector orthogonal to the third does not further increase the coherence. Since a global solution cannot do better than minimize for the first one, then not increase the coherence any more, the solutions for $B_{1}$ and $B_{2}$ are the same.

## Question 4: Robust PCA

### a)

1. We need to stack the frames as columns of a matrix X. It will be decomposed in X ~= L + S, where the background ends up in L, since it is not moving, so it is low-rank (as it does not change, so it is basically the same column repeated over and over) and the foreground ends up in S, since it is moving, but takes less space, so it is sparse.

2. The background has to take most part of the space, and it must not change much, so that the columns representing the background are the low-rank component. The foreground has to take not much space so it is sparse, but it must be moving and changing, so it is not low-rank. We need the guarantee that neither L nor S are both low-rank AND sparse. In particular, the coherence condition states that the Principal Components of L must not be sparse. This can be a problem if the background color/brightness differs too much within the video. For exact recovery, it must fulfill condition of the 'exact recovery theorem'. (Do we need to give the formulas?)

### b)

Convex.

First, show that "- b" does not matter:

$\max _{i}(t\mathbf {x} _{i}+(1-t)\mathbf {y} _{i}-b)=\max _{i}(t\mathbf {x} _{i}+(1-t)\mathbf {y} _{i})-b=t\max _{i}(\mathbf {x} _{i}-b)+(1-t)\max _{i}(\mathbf {y} _{i}-b)=t\max _{i}(\mathbf {x} _{i})-tb+(1-t)\max _{i}(\mathbf {y} _{i})-(1-t)b=t\max _{i}(\mathbf {x} _{i})+(1-t)\max _{i}(\mathbf {y} _{i})-b$

So it cancels out on each side of the inequality.

Now, we need to show that taking the maximum element of a vector is convex:

$max(t\cdot \mathbf {x} +(1-t)\cdot \mathbf {y} )\leq t\cdot max(\mathbf {x} )+(1-t)\cdot max(\mathbf {y} )$

Let $x'$ and $y'$ denote the vector entries that maximize $t\cdot x+(1-t)\cdot y$.

Then, the left side is $t\cdot x'+(1-t)\cdot y'$.

We have $t\cdot x'\leq t\cdot max(x)$ and $(1-t)\cdot y'\leq (1-t)\cdot max(y)$ for any x', y'.

Combining these inequalities, we get

$max(t\cdot x+(1-t)\cdot y)\leq t\cdot max(x)+(1-t)\cdot max(y)$ as desired.

#### Alternative Solution

We directly show that $f(tx+(1-t)y)\leq tf(x)+(1-t)f(y)$

$f(tx+(1-t)y)=\max _{i}((t\mathbf {x} +(1-t)\mathbf {y} )_{i}-b)=\max _{i}(t\mathbf {x} _{i}+(1-t)\mathbf {y} _{i}-b)=\max _{i}(t\mathbf {x} _{i}+(1-t)\mathbf {y} _{i})-b\leq t\max _{i}(\mathbf {x} _{i})+(1-t)\max _{i}(\mathbf {y} _{i})-b=t\max _{i}(\mathbf {x} _{i})+(1-t)\max _{i}(\mathbf {y} _{i})-b+tb-tb=t\max _{i}(\mathbf {x} _{i})-tb+(1-t)\max _{i}(\mathbf {y} _{i})-(1-t)b=t\max _{i}(\mathbf {x} _{i}-b)+(1-t)\max _{i}(\mathbf {y} _{i}-b)=tf(x)+(1-t)f(y)$

### c)

1. We still want to minimize $||L||_{(nuclear)}+\lambda \cdot ||S||_{1}$, but the constraint must only hold for the observed values, i.e., $X_{(i,j)}=L_{(i,j)}+S_{(i,j)}$ only for (i, j) observed.

2. In original Robust PCA, X = L + S must hold, i.e., for all values of X. When entries are missing, the values in L and S do not have constraints, so they can just minimize the main objective without following any constraint.

3. X is the original matrix. L is the low-rank component and S is the sparse component. The observed values are the ones for which we know the user ratings. lambda is a factor that determines how much a nonzero element of S costs us, i.e., it determines how much we care about sparseness of S.

### d)

This question corresponds pretty much to Problem 1 in Series 6 (SS 2015).

#### 1.

First of $\mathbf {A} \mathbf {x} \leq \mathbf {b} \Leftrightarrow \mathbf {A} \mathbf {x} -\mathbf {b} \leq {\boldsymbol {0}}$.

$L(\mathbf {x} ,{\boldsymbol {\lambda }})=\mathbf {c} ^{T}\mathbf {x} +{\boldsymbol {\lambda }}^{T}(\mathbf {Ax} -\mathbf {b} )=\mathbf {c} ^{T}\mathbf {x} +{\boldsymbol {\lambda }}^{T}\mathbf {Ax} -{\boldsymbol {\lambda }}^{T}\mathbf {b} =-{\boldsymbol {\lambda }}^{T}\mathbf {b} +(\mathbf {c} ^{T}+{\boldsymbol {\lambda }}^{T}\mathbf {A} )\mathbf {x} =-{\boldsymbol {\lambda }}^{T}\mathbf {b} +(\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }})^{T}\mathbf {x}$

The dual function:

$g({\boldsymbol {\lambda }})=\inf _{x}L(\mathbf {x} ,{\boldsymbol {\lambda }})=\inf _{x}-{\boldsymbol {\lambda }}^{T}\mathbf {b} +(\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }})^{T}\mathbf {x} =-{\boldsymbol {\lambda }}^{T}\mathbf {b} +\inf _{x}\left((\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }})^{T}\mathbf {x} \right)$

#### 2.

Using the dual function found in the previous step we see that

$g({\boldsymbol {\lambda }})={\begin{cases}-{\boldsymbol {\lambda }}^{T}\mathbf {b} &{\text{if }}\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }}=\mathbf {0} \\-\infty &{\text{otherwise}}\end{cases}}$

#### 3.

Therefore we want to maximize $g({\boldsymbol {\lambda }})$ subject to $\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }}=\mathbf {0}$ and ${\boldsymbol {\lambda }}\geq \mathbf {0}$.

NOTE: The following steps are only valid if $\mathbf {A}$ is invertible. We do not know that! So we must assume $\mathbf {A}$ is not invertible.

$\mathbf {c} +\mathbf {A} ^{T}{\boldsymbol {\lambda }}=\mathbf {0} \Leftrightarrow \mathbf {A} ^{T}{\boldsymbol {\lambda }}=-\mathbf {c} \Leftrightarrow \quad {\text{iff we can invert }}\mathbf {A} {\boldsymbol {\lambda }}=-(\mathbf {A} ^{T})^{-1}\mathbf {c}$