The idea is to differentiate by :
(we differentiate according to: http://fourier.eng.hmc.edu/e161/lectures/algebra/node7.html )
- We first sample the cluster index according to the probabilites
- Then we sample the data point from the distribution
function x=sample_gmm(pis, mus, sigma)
t = rand()
i = 0
s = 0
for k in pis
i = i+1
s = s + k % Try to fit k to a pis(k)
if t <= s
% Select a pis we generated from t
x = rand_gaussian(mus(i), sigma)
k = find(cumsum(pis) >= rand(1), 1);
x = rand_gaussian(mus(k), sigma);
Where is if otherwise .
The task is to "[...] not compute the values of all these events, but only give a symbolic expression." Accordingly I'd write
blue +: Gaussian Mixture Model
red x: K-Means
Reason: K-means is more sensitive to outliers. Additionally k-means can only consider points that are assigned entirely to its cluster for centroid estimation: a lot of the points that are close are not in the own cluster and therefore the centroids get pushed outside.
2. Dimensionality Reduction
Singular Value Decomposition
Note: according to the notation used in the exam (last page)
Choose because the first three singular values are very strong while the others are very weak.
In : box out the left 3 columns
In : box out the upper 3 rows
In : box out the upper left matrix
Approximation error under Euclidean matrix norm for is the value of
⇒ user 5 likes item 3
2nd and 5th items are identical. They have the same vector in , indicating that they have the same characteristics (concepts).
The different values in their 7th component is irrelevant because . These values vanish anyway by multiplication with
Principal Component Analysis
is covariance matrix of , i.e. ( resp. )
(Note: previously it was stated that this is which is not true, as nowhere is mentioned that is square)
Giving the eigenvalue decomposition of
Eigenvectors are the columns of
Eigenvalues are the diagonal elements of
3. Role Mining
Fraction of wrongly reconstructed permissions. Low=good, High=bad
High GE: Many wrongly predicted permissions -> bad.
We could not predict the permissions well based on a subset of all permissions.
Low GE: Good
Same formula, but:
Deviation is based on the reconstruction of the original data set.
GE is based on the reconstruction of a new data set, where the reconstruction is based only on minimizing the deviation on a subset of permissions (prediction).
Quote from slides:
X is typically noisy, i.e., it contains permission assignments without an underlying role!
Therefore noisy means to have permissions without an underlying role and a GE=0 should not be possible.
4. Deterministic Annealing
a) Correct order of code lines: 7 5 3 1 4 6 2
Note: This seems to be the EM Algorithm for Deterministic Annealing. EM Algorithms usually (the ones presented in 2015) perform first the E-step, then the M-step. Hence, I'd expect to see first line 3 followed by line 1. However, in 2015 this algorithm was not part of the lecture. Would be great if someone finds a source confirming one of both orders (.gregor)
Note: The order was wrong, it is first 3 then 1. See http://www.inf.ethz.ch/personal/alexeygr/slt15/lectures/slt15_lecture03.pdf page 15 and 17.
b) Quote by Kenneth Rose (p.6-7):
Deterministic annealing can be intuitively understood as follows. Instead of making noisy moves on the given energy surface, we "incorporate" the noise into the energy function. We derive a sequence of effective energy functions that are parametrized by the temperature, the level of the noise. These effective cost functions will be very smooth at low , where even very large barriers can be easily traversed, and as is increased they become more ragged and converge to the original energy at . In fact, at the cost function will usually be convex, so that the global minimum of this function is easily obtained. Thus the deterministic annealing approach can be viewed as locating the global minimum at T=0, and tracking the minimum while gradually increasing , by using conventional convex optimization methods at each temperature.
Note: Whereas the quote is correct, notice that it talks about a parameter, not . We have . Therefore, the argument about T is the other way around (i.e that the surface is more smooth for and more ragged at for
5. Sparse Coding
Two vectors in an inner product space V are orthogonal if
their inner product is zero.
An orthogonal matrix is a square matrix with real entries whose column (or rows) are orthogonal
Let , i.e. are column vectors of .
is orthogonal if
- its column vectors are orthonormal:
- it is a square matrix
The exam requires three properties. I'd say:
- Columns are pairwise orthogonal (case )
- Columns are unit vectors (case )
- Matrix is square
Reasons why each step holds:
-Definition of euclidean norm
-Property of transpose
-Property of I, definition of euclidean norm
We have shown that:
Apply an orthogonal matrix U on a vector X preserves the euclidean length.
Coding: transform x into Haar basis
Thresholding: cut off small components
The sum of the normalized basis vectors . The length of the sum of all normalized basis vectors is
This results in
Reason: The signal has to be represented with all 4 basis vectors. If we would represent this signal with 3 basis vectors, then the reconstruction error would increase.
[~, cur] = max(abs(U' * res));
res = res - coef * U(:, cur);
As the question states , we only need to do 2 iterations.
Note that in the picture the residuum is directed in the wrong direction: the residuum must always point towards the real value, as it holds that: approx + residuum = real vector