Just an idea for a spotlight this evening
Parallel Computing
Parallelism in the hardware
- Bit-wise parallism; for example parallel adders
- Pipelines: overlap the stages of the fetch/decode/execute cycle
- Coprocessors: special purpose processing unit. E.g., in the old days you may have one of these for floating point
operations, these days a common example is the graphics card - Superscalar: duplicating some parts of the CPU
- Multicore: two or more CPUs located on the same chip
Classification of Parallelism
- Same Instruction Multiple Data (SIMD)
- Multiple Instructions Same Data (MISD)
- Multiple Instruction Multiple Data (MIMD)
- Same Instruction Same Data (SISD) - not parallel, just here to complete the set.
Shared vs Distributed Memory
Shared Memory
- single memory bus
- shared data
- here, the memory is the bottleneck. CPU cache is used to speed things up, but now we have the problem of cache
coherence (keeping all the CPU caches in sync) - a common low level interface is POSIX Threads
- create / destroy threads
- mutex (mutual exclusion) locks
- semaphores
- barriers
Distributed Memory
- each processor hos it’s own separate memory
- typically, processors are linked using a network
- they communicate by passing messages
- here the message passing is the bottleneck
- a common low-level interface is MPI (Message Passing Interface)
- initialise a program with n (fixed throughout the program) processors
- send / recv data (both blocking and asynchronous versions)
- scatter / gather (to distribute and collect data from the distributed processors)
- reduce (do an operation accumulated across all the processors. for example take a flag value from all processors and
do logical and)
After all that, was it even worth it?
- Amdahl’s Law: there is a natural upper bound on how much faster your program will get, regardless of how many
processors you throw at it. - Gustaffson’s Law: given an infinitely large problem, a parallel implementation on n processors can attain a speedup
approaching n. - Speedup = time to run a sequential version / time to run a parallel version using p processors
- a simple measure telling you how much faster (or not) your program gets when you add more processors
- it often takes a while for the parallel version to even catch up with the sequential one
- Efficiency = Speedup / number of processors
- an indication of how well you are utilising the hardware
Solving the array relaxation problem in parallel
- find the average of the 4 neighbouring elements.
- stop when old and new values are close enough together (i.e., when the difference is less than some threshhold, say
0.01)
Shared memory version
- two arrays shared among k threads#
- one array holding values from last iteration, the other contaning values which are added during this iteration
- each thread works on m elements, so that all threads have approximately the same amount of work to do
- shared variable indicating how many threads have finished working (protected by a mutex)
- barrier is used to ensure all threads wait until everyone has finished the current round of calculations
Results
- we get a slow-down for 2 threads, then speedup is observed between 3 and 16 threads
- drops off for > 16 threads, because now we have more running threads than we have on the hardware, so they have to be
scheduled (overhead!)
Distributed memory
- MPI needs continuous blocks of memory to share data, so arrays are implemented in the following way:
- one continuous long (n \times k) array to store the data
- an array of n pointers to each “row” in the underlying array.
- use Scatter to spread the data among the threads
- eac thread does it’s work
- use send/recv to send the top/bottom row to neighbouring threads
- use allreduce with logical AND to figure out if we’re all done
- alternative way I tried was to gather the data at the end of each calculation step, work out
if we’re done in the main thread, then redistribute it. It was waaaay slower because of all the sequentialism, and all
the communication overhead.
Results
- send/recv version: good speedup, some superlinear speedup near the beginning, maybe because of a bad reference (sequential)
implementation(??) - scatter/gather version: still a good speedup, but dropping off after 16 threads. most likely because each processor
has 16 cores, so now we have to use the slowwwwww network as opposed to faster RAM for communication.