[Spotlight] Parallel computing - SIMD with shared and distributed memory architectures

hollie · 9 June 2021 17:43

Just an idea for a spotlight this evening

Parallel Computing

Parallelism in the hardware

Bit-wise parallism; for example parallel adders
Pipelines: overlap the stages of the fetch/decode/execute cycle
Coprocessors: special purpose processing unit. E.g., in the old days you may have one of these for floating point
operations, these days a common example is the graphics card
Superscalar: duplicating some parts of the CPU
Multicore: two or more CPUs located on the same chip

Classification of Parallelism

Same Instruction Multiple Data (SIMD)
Multiple Instructions Same Data (MISD)
Multiple Instruction Multiple Data (MIMD)
Same Instruction Same Data (SISD) - not parallel, just here to complete the set.

Shared vs Distributed Memory

Shared Memory

single memory bus
shared data
here, the memory is the bottleneck. CPU cache is used to speed things up, but now we have the problem of cache
coherence (keeping all the CPU caches in sync)
a common low level interface is POSIX Threads
- create / destroy threads
- mutex (mutual exclusion) locks
- semaphores
- barriers

Distributed Memory

each processor hos it’s own separate memory
typically, processors are linked using a network
they communicate by passing messages
here the message passing is the bottleneck
a common low-level interface is MPI (Message Passing Interface)
- initialise a program with n (fixed throughout the program) processors
- send / recv data (both blocking and asynchronous versions)
- scatter / gather (to distribute and collect data from the distributed processors)
- reduce (do an operation accumulated across all the processors. for example take a flag value from all processors and
  do logical and)

After all that, was it even worth it?

Amdahl’s Law: there is a natural upper bound on how much faster your program will get, regardless of how many
processors you throw at it.
Gustaffson’s Law: given an infinitely large problem, a parallel implementation on n processors can attain a speedup
approaching n.
Speedup = time to run a sequential version / time to run a parallel version using p processors
- a simple measure telling you how much faster (or not) your program gets when you add more processors
- it often takes a while for the parallel version to even catch up with the sequential one
Efficiency = Speedup / number of processors
- an indication of how well you are utilising the hardware

Solving the array relaxation problem in parallel

find the average of the 4 neighbouring elements.
stop when old and new values are close enough together (i.e., when the difference is less than some threshhold, say
0.01)

Shared memory version

two arrays shared among k threads#
one array holding values from last iteration, the other contaning values which are added during this iteration
each thread works on m elements, so that all threads have approximately the same amount of work to do
shared variable indicating how many threads have finished working (protected by a mutex)
barrier is used to ensure all threads wait until everyone has finished the current round of calculations

Results

we get a slow-down for 2 threads, then speedup is observed between 3 and 16 threads
drops off for > 16 threads, because now we have more running threads than we have on the hardware, so they have to be
scheduled (overhead!)

Distributed memory

MPI needs continuous blocks of memory to share data, so arrays are implemented in the following way:
- one continuous long (n \times k) array to store the data
- an array of n pointers to each “row” in the underlying array.
use Scatter to spread the data among the threads
eac thread does it’s work
use send/recv to send the top/bottom row to neighbouring threads
use allreduce with logical AND to figure out if we’re all done
alternative way I tried was to gather the data at the end of each calculation step, work out
if we’re done in the main thread, then redistribute it. It was waaaay slower because of all the sequentialism, and all
the communication overhead.

Results

send/recv version: good speedup, some superlinear speedup near the beginning, maybe because of a bad reference (sequential)
implementation(??)
scatter/gather version: still a good speedup, but dropping off after 16 threads. most likely because each processor
has 16 cores, so now we have to use the slowwwwww network as opposed to faster RAM for communication.

Ed_S · 9 June 2021 19:05

Excellent! Lots of material there. And for me, of course, it prompts loads of thoughts about computer history… I could do a spotlight on the intersection there.

hollie · 9 June 2021 21:12

Thanks!

Yes please, a spotlight on computer history would be really interesting!

Joe · 14 June 2021 12:35

Kinda relevant but I don’t think as low level as you guys are interested in:

Does anyone have any good resources for parallel processes in python? I’ve done some googling and can’t quite get my head round it.

I’m trying to query data using a list and then do something with all the data as fast as possible. the querying is the bottleneck and I’m not hitting network limits yet. I’m thinking I can query using each item of the list in parallel, then sort it (if sorting is even needed) and then do whatever with the data.

Ed_S · 14 June 2021 13:46

Interesting question: I had a quick look and pymp looked interesting. (Via the python wiki’s Parallel Processing page.)

I’ve a feeling one of python’s features, or limitations, is the Global Interpreter Lock, which limits parallelisation.

Will · 14 June 2021 19:47

I’ve not done much in that area in Python, other than preprocessing/cleaning data files & I’m pretty sure I leant on some ML Frameworks for some of that (and the datasets weren’t mega-huge).

I can’t vouch for these, but they’re focused on parallel processing of list/array objects:

Dask looks interesting & powerful (apply custom function to each element or they have APIs matching bumpy, pandas etc. with distributed scheduling available - )

And Parallel Collections looks pretty good to get parallel list proceeding using [map, reduce, filter] (multi-core, but AFAICT single machine)

Joe · 15 June 2021 14:45

Thanks Ed! I’ve gone from 37s on a single thread to 9s with 4 threads so pretty good so far!

Pretty easy to implement in my case and made much more sense than the pool examples I was looking at. I’m using Google Colab while at work and it sometimes fails with connection terminated. Will try on my own hardware later and increase the threads.

Edit: Now I’m hitting the limit on server requests! Don’t think I’m going to get much more out of it. Need to get creative on making sure no data goes missing when request fails.

Ed_S · 16 June 2021 17:48

I’ve put some thoughts together but have a filthy rotten headache so I don’t think I can deliver a talk today

Great news!

Ed_S · 30 June 2021 12:38

Nice (but long!) technical post by Russ Cox here about the semantics of shared memory in multi-core systems. Not only is x86 a bit special (compared to ARM for example), but sometimes a particular CPU implementation will be deceptively helpful:

…hardware implementing a stronger model than technically guaranteed encourages dependence on the stronger behavior and means that future, weaker hardware will break programs, validly or not.

I’m purposely leaving out a lot of subtleties in the ARM and POWER weak memory models. For more detail, see any of Peter Sewell’s papers on the topic. There are two important points to take away. First, there is an incredible amount of subtlety here, the subject of well over a decade of academic research by very persistent, very smart people. I don’t claim to understand anywhere near all of it myself. This is not something we should hope to explain to ordinary programmers, not something that we can hope to keep straight while debugging ordinary programs. Second, the gap between what is allowed and what is observed makes for unfortunate future surprises.

As we will see in the next post, the question of a memory model for a higher-level programming language does not have as neat and tidy an answer.