# Lösungsvorschlag Parallel Computing HS17: Unterschied zwischen den Versionen

Speedup, Amdahl, Gustafson
1. Exercise
• Hardware is imperfect
• Code is not 100% parallelizable
1. $S_{p}=p-f*(p-1)\Rightarrow 3=4-f(4-1)=4-3f\Rightarrow 3f=1\Rightarrow f={\frac {1}{3}}$
2. $S_{p}={\frac {1}{f+{\frac {1-f}{p}}}}\Rightarrow 3={\frac {1}{f+{\frac {1-f}{4}}}}={\frac {4}{3f+1}}\Rightarrow 9f+3=4\Rightarrow f={\frac {1}{9}}$

Pipelining

2. Exercise
1. $\infty$. The Pipeline is unbalanced regardless of the duration of sharpen, since compress lasts longer than trim.
2. ${\frac {1}{100ms}}={\frac {1}{100*10^{-3}}}={\frac {10^{3}}{100}}=10\ img/s$. Therefore sharpen takes 100ms.
3. Latency is unbounded. Or:
• Latency in Round 1: 260ms
• Latency in Round 2: 320ms

3. Exercise
1. 5, since we count five nodes (each node is a task with runtime 1 according to the exercise) starting at $x_{1}$.
2. 7, since we could compute all the variables and the node "sum" in parallel.
3. 4, since in the broadest part the task graph has a width of 4. With less processors either $x_{3}$ or $y_{3}$ will not be ready in time for the summation and we will get a longer runtime than the optimum of 4.
4. The critical Path is of length 5. One can reach an execution time of 5 by using 4 processors. The critical Path is determined by dependencies in the Data Flow. In our case the sequential summation is the bottleneck since it is not parallelizable (in this task graph).
4. Exercise
5. Instead of calculating the sums one after each other we can calculate the sum of (sum + (x_1+y_1)) and the sum of ((x_2+y_2) + (x_3+y_3)) in parallel and then add those sums together. This minimises the critical path from 5 to 4.

6. Exercise
1. public class BankAccount {
private double amount = 0;

public void synchronized deposit(double amount) throws UnsupportedOperationException {
if (amount < 0) {
throw new UnsupportedOperationException("Unable to deposit negative amount.");
}

this.amount += amount;
this.notifyAll();       // wake up waiting threads to check whether balance is sufficient for withdrawal
}

public void synchronized withdraw(double amount) {
// waits until there is sufficient money in the account
while (this.amount < amount) {
try {
this.wait();
} catch (InterruptedException e) {}
}

this.amount -= amount;
this.notifyAll();
}

public double synchronized getBalance() {
return this.amount;
}
}

2. Generally, yes. If getBalance() is called externally with a different lock or no lock at all, we could get a data race between the external state and the object state.

Fork / Join Framework

7. Exercise
1. public class MaxTripleTask extends RecursiveTask<Double> {
int start;
int length;
double[] input;

public MaxTripleTask(double[] input, int start, int length) {
this.start = start;
this.length = length;
this.input = input;
}

protected Double compute() {
if(length == 1) {
if(start == 0 || start == input.length-1)
return 0.0;
else
return input[start-1] + input[start] + input[start+1];
}

int l1 = length / 2;
int l2 = length - l1;

left.fork();

return Math.max(right.compute(), left.join());
}
}


or

public class MaxTripleTask extends RecursiveTask<Double> {
int start;
int length;
double[] input;

public MaxTripleTask(double[] input, int start, int length) {
this.start = start;
this.length = length;
this.input = input;
}

protected Double compute() {
// cutoff, same as Sequential version
if (length <= 10) {
double result = 0;

for (int i = start; i < start + length - 2; i++) {
result = Math.max(result, triple_starting_at_pos(input, i));
}

return result;
} else {
int l1 = length / 2;
int l2 = length - l1;

f2.fork();
f1.fork();

// have to check the two triplets that pass the two parts
double tripleOverMiddle = Math.max(triple_starting_at_pos(input, start+l1-1), triple_starting_at_pos(input, start+l1-2));

return Math.max(tripleOverMiddle, Math.max(f1.join(), f2.join()));
}
}

static double triple_starting_at_pos(double[] input, int pos) {
return input[pos] + input[pos + 1] + input[pos + 2];
}
}

2. Unit testing with a correct sequential version of the program.
3. For c) 100, because practically the overhead produced by the parallelism overwhelming single thread computation time, if less than 100-500 operations (optimal threshold according to the lecture) are computed by a single thread.

Progress Condition

8. Exercise
• Wrong. It has locks, and therefore can't be wait-free.
• Correct
• Wrong. It has locks.
• Correct
1. public class Position {
// --- Begin code section 0
AtomicReference<Data> currentData = new AtomicReference<Data>(new Data(-1, -1));
// --- End code section 0

public void update(Data data) {
// --- Begin code section 1
Data oldData = currentData.get();

while (oldData.getTime() < data.getTime() && !currentData.compareAndSet(oldData, data)) {
oldData = currentData.get();
}
// --- End code section 1
}

// --- Begin code section 2
return currentData.get();
// --- End code section 2
}
}


Critical Sections

9. Exercise
2. [nice picture needed]
3. Deadlock-freedom is violated on lines 3 and 8.

Linearisability and Sequential Consistency

10. Exercise
1. H1
     A x.enqueue(5)  *---------(5)-------*
B x.enqueue(7)    *-(7)-*
B x:void                <>
B x.dequeue()               *-()-*
B x:5                            <>
A x:void                            <>


H2

     A x.enqueue(5)  *-(5)-*
A x:void              <>
A x.dequeue()            *-------()--*
B x.enqueue(7)             *-(7)-*
B x:void                         <>
A x:7                                <>

2. Only 2 can be dequeued, since it must hold that B x.enqueue(1) happened before A x.dequeue(1) such that B x.dequeue() returns 2.
     A x.enqueue(0)  *-(0)-*
->  B x.enqueue(1)    *-------(1)-------------*
A x:void              <>
->  A x.dequeue()            *-()-*
->  A x:1                         <>
A x.enqueue(2)                   *-(2)-*
A x:void                               <>
B x:void                                  <>   (but insertion must have happened way before, since A was already able to dequeue 1)
B x.dequeue()                                *-()-*
B x:???                                           <>


Sorting Networks

11. Exercise
1. No, a sorting network always makes the same amount of comparisons.
2. If a sorting network sorts all possible inputs of 0's and 1's. It also sorts all possible inputs.
3. Brute force:
• Write down all possible inputs of 0's and 1's
• Ignore the ones that are already sorted at the beginning.
• For all others, let them rip through the sorting network, and check if it's sorted at the end. If a sequence is sorted after the first step, there is no need to continue sorting, since it can't go back to an unsorted state.
• If all inputs end up sorted, you have proven that the sorting network works. Else you have a counter-example, proving it doesn't.
4. Counter-example: ${\begin{bmatrix}1\\1\\0\end{bmatrix}}$

MPI

12. Exercise
1. 1   public static void Transfer (double x[], int length) {
2      // ...we assume that MPI.Init has already been called...
3      int mpiRank = MPI.COMM_WORLD.Rank();
4
5      double[] copy = new double[length];
6
7      if (mpiRank == 0) {
8         MPI.COMM_WORLD.Send(x, 0, length, MPI.DOUBLE, 1, 0);
9      } else {
10        MPI.COMM_WORLD.Recv(copy, 0, length, MPI.DOUBLE, 0, 0);
11     }
12     // ...we assume that MPI.Finalize will be called at a later stage...
13  }

2. 1  public static void MpiMatrixVector(float A[], float x[], float y[], int rows, int cols) {
3     // ...we assume that MPI.Init has already been called...
4     int mpiRank = MPI.COMM_WORLD.Rank();
5     int mpiSize = MPI.COMM_WORLD.Size();
6     // You can assume that rows is divisable by mpiSize
7     int rowsLocal = rows / mpiSize;
8     // ---- Begin code section 0
9
10    MPI.COMM_WORLD.Bcast(x, 0, x.length, MPI.FLOAT, 0);
11
12
13    // ---- End code section 0
14    float ALocal[] = new float[cols*rowsLocal];
15    // ---- Begin code section 1
16
17    MPI.COMM_WORLD.Scatter(A, 0, cols*rowsLocal, MPI.FLOAT, ALocal, 0, cols*rowsLocal, MPI.FLOAT, 0);
18
19
20    // ---- End code section 1
21    float yLocal[] = new float[rowsLocal];
22    for (int i = 0; i < rowsLocal; ++i) {
23       yLocal[i] = 0;
24       for (int j = 0; j < cols; ++j) {
25          yLocal[i] += ALocal[i*cols + j] * x[j];
26       }
27    }
28    // ---- Begin code section 2
29
30    MPI.COMM_WORLD.Gather(yLocal, 0, rowsLocal, MPI.FLOAT, y, 0, rowsLocal, MPI.FLOAT, 0);
31
32
33    // ---- End code section 2
34    // ...we assume that MPI.Finalize will be called at a later stage...
35 }


Locks and Synchronization

13. Exercise
• Correct, because a wait-free implementation must be unblocking and locks are inherently blocking.
• Wrong. It is "deadlock-free + fair implies starvation-free".
• Wrong. Every thread that attempts to take the lock eventually succeeds. (see The Art of Multiprocessor Programming, p 24)
• Wrong, since the ABA problem occurs while using CAS or TAS.
• Wrong, because the semantics of Semaphores state that a Semaphore has a capacity $c$ and it will let up to $c$ threads into the critical section.
• Correct, since locks guarantee mutual exclusion on the critical section.
• Correct. We can implement it in Java with (capacity = 1) Semaphore mutex = new Semaphore(1);
• Correct, since a total order on the locks prevents deadlocks, because circular references in the dependency graph are not possible.
1. If a dependency graph has a circular dependency, a deadlock will occur. In case of the philosophers, all the philosophers aquire the fork to their right such that no one can pick up their left fork since it is already in use.
• Wrong.
• Wrong, since the random order could be the order, which causes a deadlock.
• Correct.
2. 1   public class FIFO {
3       private final int capacity = 10; // capacity of the queue
4       private int count = 0; // current size of the queue
5
6       private final Lock lock = new ReentrantLock();
7
8       private final Condition notFull = lock.newCondition();
9       private final Condition notEmpty = lock.newCondition();
10
11      public void enqueue(int x) {
12      // --- Begin code section 0

13         lock.lock();
14         while (count == capacity) {
15            try {
16               notFull.await();
17            } catch (InterruptedException e) {}
18         }
20         count++;
21         notEmpty.signalAll();
22         lock.unlock();
23

24      // --- End code section 0
25      }
26
27      public int dequeue() {
28      // --- Begin code section 1

29         lock.lock();
30         while (count == 0) {
31            try {
32               notEmpty.await();
33            } catch (InterruptedException e) {}
34         }
35         int elem = queue.getFirst();
36         count--;
37         notFull.signalAll();
38         lock.unlock();
39         return elem;

40      // --- End code section 1
41      }
42  }