Disclaimer: I am very unsure about those solutions, but still adding them so that something is there that can be corrected :D
1 Dimensionality reduction
1.1 PCA on an Ellipse
for x coordinate:
(since cosine in interval between 0 and 2 * pi has a mean of 0)
for y coordinate:
(since sinus in interval between 0 and 2 * pi has a mean of 0)
since mean is 0, nothing has to be subtracted. So cov matrix will be
= (two sums are the same for sure, but I am not sure which value they sum up to, let's call it y)
The final result should be
since we know that a > b, the first principal vector is with eigenvalue
it just keeps the x-coordinate, so
reconstruction error = y coordinate =
Remark: I think the reconstruction error refers to the total error of points, which is the sum of unkeeped eigenvalues. That is
1.2 PCA in 3D
then rewrite this according to given formula.
The first principal component is
Its corresponding eigenvalue is
The easiest way to arrive at these values is realizing a priori that the points lie on a line and therefore its direction must be the eigenvector corresponding to the only non-zero eigenvalue. Then you can just multiply the covariance matrix with any direction vector on the line and determine how much its length is scaled up by the linear transformation. If you don't realize you can calculate the determinant of the matrix minus lambda on the diagonal and set it to zero, but surely nobody would ever do that... Who me? Noo...
normalize eigenvector first and then multiply:
The reconstruction error is 0, since the data can be perfectly reconstructed (rank of covariance matrix is 1).
1.3 SVD for Solving Linear Systems
we can multiply this by since it is an orthogonal matrix and this won't change the length.
The problem is convex, so any locally optimal solution will be the global optimum, thus it is unique. To minimize, take the derivative and set it to zero:
We can leave out all the dimensions that are set to zero in since they will be multiplied by 0 on both sides anyways. Let's take as the square (invertible) matrix that consists of the nonzero entries. We now have:
From a we know that we can instead optimize . If we set and , then we get and therefore
1.4 SVD Properties
really not sure what they expect here?
so eigenvalues are
so eigenvalues are ， which is the same as before.
A being a square symmetric matrix means that eigenvectors are orthogonal.
2 Optimization, NMF, pLSA and Word Embedding
Proof by contradiction: Assume we have a local minimum that is not the global minimum, so so that but so that .
We know that our function is convex, so it must hold that .
Let's now choose so that ends up in the ball around , so we know that .
We also know since , that
So we get which contradicts the equation stated in the beginning (convexity).
Given in the question we have the smoothness condition:
And then we have the relation between and :
Combined we see that:
And in particular if we choose the point for the generic we get the following true expression which we will work towards later:
That line is our goal. Starting from the expression the question asks us to prove we can show equivalence to the goal and thereby conclude the proof.
Expanding squared two norms
Replacing with and the gradient descent update step
Using linearity of scalar product
Expanding norm again
Using linearity of scalar product
Moving the out
Dividing by which is positive so the direction doesn't flip
And that's our goal from above. QED
if then we know that , so we must have reached the global optimum.
This is similar as c), but since the function is not convex anymore it is just an optimum, not necessarily the global optimum.
2.2 pLSA and Word Embedding
1) how often does word j occur in document i
3) lower bound:
2) to make sure that words that occur more often together have a larger weight, but also that the weight is capped at some point.
2.3 True/False Questions
Isn't it true? The Frobenius norm is convex for
. The question says it is convex in , not on specifically.)
False? We don't need step (b)?
3 Neural networks and clustering
3.1 Neural Networks
I'm not sure about the summation. Here's my solution
then by Cauchy-Schwartz
so if we choose and , we get the desired result.
It takes O(n) because we have to differentiate with respect to every weight.
1) extract interesting features from image
2) reduce size of image
Each round you go from nxn to (n-2)x(n-2). So wee need (rounded up) (n-3)/2 iterations.
3.2 True/False Questions
a) T b) F c) T d) F e) F
1) compute cluster centers:
2) assign data points:
3) compute cluster centers:
4) assign data points:
From now on, nothing will change anymore.
Intuitive approach, no idea if that might be true:
mean: covariance: total:
1. do we need to distinguish whether the covariance matrix is diagonal?
2. For the , we still require parameters?)
3.4 K-means vs. Gaussian mixture modeel (GMM)
K-means: clusters are initialized uniformly at random, so
In k-means clusters are spherical, so covariances are . In GMM, clusters can be ellipsoids, so covariance matrix does not need to be diagonal.
Two very stretched ellipses next to each other would be difficult for k-means, since it expect spheres. It should not be a problem for GMM though, since there the shape of the ellipsoid can be learned as well.
A calculation of each step is cheaper in K-means. In addition to that, K-means needs less iterations.
4 Sparse Coding and Dictionary Learning
4.1 1-D Fourier transform
for signal b, since it has a lot of high frequencies and a has none.
Because then we have to also save the permutation, which will probably cost as much as saving the whole signal.
4.2 2-D Fourier transform
Image b, since it has less high frequencies than image a (which has scarf and tablecloth)
a, since b has a lot of high frequencies in the original image which will be lost. Image a originally does not have a lot of high frequencies (see a), so denoising will work well.
4.3 sparse coding with an over-complete dictionary
draw four vectors from middle cluster to the outer clusters.
Each center corresponds to one atom and additionally one center corresponds to no atom at all (0,0).
4.4 Compressed Sensing
No, there are no non-zero elements.
Yes it is, only the on counterfeit coin will have a non-zero element.
for some constant c