[Spotlight] Parallel computing - SIMD with shared and distributed memory architectures

Just an idea for a spotlight this evening :slight_smile:

Parallel Computing

Parallelism in the hardware

  • Bit-wise parallism; for example parallel adders
  • Pipelines: overlap the stages of the fetch/decode/execute cycle
  • Coprocessors: special purpose processing unit. E.g., in the old days you may have one of these for floating point
    operations, these days a common example is the graphics card
  • Superscalar: duplicating some parts of the CPU
  • Multicore: two or more CPUs located on the same chip

Classification of Parallelism

  • Same Instruction Multiple Data (SIMD)
  • Multiple Instructions Same Data (MISD)
  • Multiple Instruction Multiple Data (MIMD)
  • Same Instruction Same Data (SISD) - not parallel, just here to complete the set.

Shared vs Distributed Memory

Shared Memory

  • single memory bus
  • shared data
  • here, the memory is the bottleneck. CPU cache is used to speed things up, but now we have the problem of cache
    coherence (keeping all the CPU caches in sync)
  • a common low level interface is POSIX Threads
    • create / destroy threads
    • mutex (mutual exclusion) locks
    • semaphores
    • barriers

Distributed Memory

  • each processor hos it’s own separate memory
  • typically, processors are linked using a network
  • they communicate by passing messages
  • here the message passing is the bottleneck
  • a common low-level interface is MPI (Message Passing Interface)
    • initialise a program with n (fixed throughout the program) processors
    • send / recv data (both blocking and asynchronous versions)
    • scatter / gather (to distribute and collect data from the distributed processors)
    • reduce (do an operation accumulated across all the processors. for example take a flag value from all processors and
      do logical and)

After all that, was it even worth it?

  • Amdahl’s Law: there is a natural upper bound on how much faster your program will get, regardless of how many
    processors you throw at it.
  • Gustaffson’s Law: given an infinitely large problem, a parallel implementation on n processors can attain a speedup
    approaching n.
  • Speedup = time to run a sequential version / time to run a parallel version using p processors
    • a simple measure telling you how much faster (or not) your program gets when you add more processors
    • it often takes a while for the parallel version to even catch up with the sequential one
  • Efficiency = Speedup / number of processors
    • an indication of how well you are utilising the hardware

Solving the array relaxation problem in parallel

  • find the average of the 4 neighbouring elements.
  • stop when old and new values are close enough together (i.e., when the difference is less than some threshhold, say
    0.01)

Shared memory version

  • two arrays shared among k threads#
  • one array holding values from last iteration, the other contaning values which are added during this iteration
  • each thread works on m elements, so that all threads have approximately the same amount of work to do
  • shared variable indicating how many threads have finished working (protected by a mutex)
  • barrier is used to ensure all threads wait until everyone has finished the current round of calculations

Results

  • we get a slow-down for 2 threads, then speedup is observed between 3 and 16 threads
  • drops off for > 16 threads, because now we have more running threads than we have on the hardware, so they have to be
    scheduled (overhead!)

Distributed memory

  • MPI needs continuous blocks of memory to share data, so arrays are implemented in the following way:
    • one continuous long (n \times k) array to store the data
    • an array of n pointers to each “row” in the underlying array.
  • use Scatter to spread the data among the threads
  • eac thread does it’s work
  • use send/recv to send the top/bottom row to neighbouring threads
  • use allreduce with logical AND to figure out if we’re all done
  • alternative way I tried was to gather the data at the end of each calculation step, work out
    if we’re done in the main thread, then redistribute it. It was waaaay slower because of all the sequentialism, and all
    the communication overhead.

Results

  • send/recv version: good speedup, some superlinear speedup near the beginning, maybe because of a bad reference (sequential)
    implementation(??)
  • scatter/gather version: still a good speedup, but dropping off after 16 threads. most likely because each processor
    has 16 cores, so now we have to use the slowwwwww network as opposed to faster RAM for communication.
3 Likes

Excellent! Lots of material there. And for me, of course, it prompts loads of thoughts about computer history… I could do a spotlight on the intersection there.

Thanks!

Yes please, a spotlight on computer history would be really interesting! :slight_smile:

Kinda relevant but I don’t think as low level as you guys are interested in:

Does anyone have any good resources for parallel processes in python? I’ve done some googling and can’t quite get my head round it.

I’m trying to query data using a list and then do something with all the data as fast as possible. the querying is the bottleneck and I’m not hitting network limits yet. I’m thinking I can query using each item of the list in parallel, then sort it (if sorting is even needed) and then do whatever with the data.

Interesting question: I had a quick look and pymp looked interesting. (Via the python wiki’s Parallel Processing page.)

I’ve a feeling one of python’s features, or limitations, is the Global Interpreter Lock, which limits parallelisation.

1 Like

I’ve not done much in that area in Python, other than preprocessing/cleaning data files & I’m pretty sure I leant on some ML Frameworks for some of that (and the datasets weren’t mega-huge).

I can’t vouch for these, but they’re focused on parallel processing of list/array objects:

Dask looks interesting & powerful (apply custom function to each element or they have APIs matching bumpy, pandas etc. with distributed scheduling available - )

And Parallel Collections looks pretty good to get parallel list proceeding using [map, reduce, filter] (multi-core, but AFAICT single machine)

1 Like

Thanks Ed! I’ve gone from 37s on a single thread to 9s with 4 threads so pretty good so far!

Pretty easy to implement in my case and made much more sense than the pool examples I was looking at. I’m using Google Colab while at work and it sometimes fails with connection terminated. Will try on my own hardware later and increase the threads.

Edit: Now I’m hitting the limit on server requests! Don’t think I’m going to get much more out of it. Need to get creative on making sure no data goes missing when request fails.

1 Like

I’ve put some thoughts together but have a filthy rotten headache so I don’t think I can deliver a talk today :face_with_head_bandage: :anguished:

Great news!

Nice (but long!) technical post by Russ Cox here about the semantics of shared memory in multi-core systems. Not only is x86 a bit special (compared to ARM for example), but sometimes a particular CPU implementation will be deceptively helpful:

…hardware implementing a stronger model than technically guaranteed encourages dependence on the stronger behavior and means that future, weaker hardware will break programs, validly or not.

I’m purposely leaving out a lot of subtleties in the ARM and POWER weak memory models. For more detail, see any of Peter Sewell’s papers on the topic. There are two important points to take away. First, there is an incredible amount of subtlety here, the subject of well over a decade of academic research by very persistent, very smart people. I don’t claim to understand anywhere near all of it myself. This is not something we should hope to explain to ordinary programmers, not something that we can hope to keep straight while debugging ordinary programs. Second, the gap between what is allowed and what is observed makes for unfortunate future surprises.

As we will see in the next post, the question of a memory model for a higher-level programming language does not have as neat and tidy an answer.