multithreading
play

Multithreading 1 A rchitectural State and Context Switches 2 A - PowerPoint PPT Presentation

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A rchitectural State of a thread is everything that defines the state of a running program The contents of the register file The current


  1. Multithreading 1

  2. A rchitectural State and Context Switches 2

  3. A rchitectural State • The “ A rchitectural State” of a thread is everything that defines the state of a running program • The contents of the register file • The current program counter • The current contents of memory (it’s “address space”) • Note that all these are well-defined because the semantics of the IS A dictate that instructions execute one-at-a-time. • The architectural state of a processor includes this and other privileged state. • A thread is running on a processor if the processor’s register file, PC, and current contents of memory are those of the thread 3

  4. Context Switches • Switch out • Save the current register state • Save the current PC • Switch in • Install new register state • Install new address space (Sometimes) • Jump to new PC 4

  5. K inds of Context Switches • Userspace context switch • This is really just another form of control transfer • Registers and PC only • Userspace threads share an address space, so no need to change • Example: setjump()/longjump(), userspace threading packages • ~1us. • Thread switch • Happens in the kernel – each thread is a “light weight process” • Register and PC only • Threads in the same process share an address space so no need to change • ~3us • Process switch • Happens in the kernel • Register, PC, and address space • A bunch of other stuff too – file descriptors, etc. • ~3us 5

  6. Hidden Costs of Context Switches • Latency for registers, PC, and memory are unavoidable and relatively easy to measure • But the cost of “warming” the caches and predictors is non- trivial [Choi’08] [Brown ’10] 6

  7. Parallelism’s Granularity and A bundance 7

  8. Communication Between Threads is Expensive • Context switches are expensive • (see previous slides) • Coherence is expensive • On the order of accessing main memory • Communication is required to exploit parallelism • Each “fork” and “join” point in the the dependence graph requires communication. 8

  9. Parallelism Granularity (ILP) For(i = 1..5) { s[i] = a[i] + b[i] + c[i]+…; } Fine (several inst) Finest (1 inst)

  10. Parallelism Granularity (TLP) Par_msum (matrix A , Matrix B) { matrix R; R[upleft] = Spawn(msum(a[upleft],b[upleft]) R[upright] = Spawn(msum(a[upright],b[upright]) R[lowleft] = Spawn(msum(a[lowleft],b[lowleft]) R[lowright] = Spawn(msum(a[lowright],b[lowright]) Coarsest Wait(the_barrier) } (millions of inst) Coarse (1000s of inst) 10

  11. Parallelism Granularity • A ll kinds of parallelism are (in general) hard for people and computers to find. • We would like to exploit parallelism wherever it is. • Fine-grain: Modern processors are pretty good at ILP • A n infinitely large instruction window could exploit all available parallelism. • Instruction window size limits the scope of which they can find ILP • Coarse-grain: Pthreads, fork(), and openMP are ok options. • Context switch costs place a lower-bound on grain size that is profitable to exploit • So, we would like to lower the cost of synchronization/communication. 11

  12. A pps Vs. Cores • A pp variation • Some apps have lots of ILP (floating point) • Some apps have very little ILP (gcc) or any app with very poor cache performance • A n “average” threads has an IPC of about 0.5-2 • Some apps have lots of threads ( A pache) • Some have just one (most apps) • Core variation • One big, expensive, power-hungry cores is good at ILP • Many small, cores are better at TLP. • How can we have the best of both worlds? • Use bigg(er) cores, convert TLP into ILP 12

  13. Motivation 100 90 mem conflict Percent of total issue cycles 80 long fp short fp 70 long int short int 60 load delays 50 control hazards branch mispred. 40 dcache miss icache miss 30 dtlb miss 20 itlb miss processor busy 10 0

  14. Hardware Multithreading C onventional instruction Multithreaded instruction stream stream Processor Processor P C P C P C P C regs regs regs regs P C regs C PU C PU

  15. S uperscalar Execution Issue Slots Time (proc cycles)

  16. S uperscalar Execution Issue Slots Time (proc cycles) Horizontal waste Vertical waste

  17. Superscalar Execution Issue Slots Time (proc cycles)

  18. Superscalar Execution with Fine-Grain Multithreading • The processor has multiple thread contexts • PC • Register set • Memory space • Context switch time is one cycle (~300ps) • Instructions flow the pipeline together

  19. S uperscalar Execution with Fine-Grain Multithreading Issue Slots Time (proc cycles) Horizontal waste Thread 1 Thread 2 Thread 3

  20. S imultaneous Multithreading • Same multiple contexts • Fetch from multiple Issue Slots Time (proc cycles) threads per cycle • Instructions all flow Thread 1 through the pipeline Thread 2 together. Thread 3 Thread 4 Thread 5

  21. The Potential for S MT

  22. Goals • SMT Goals • 1. Minimize the architectural impact on conventional superscalar design. • 2. Minimize the performance impact on a single thread. • 3. A chieve significant throughput gains with many threads.

  23. A Conventional Superscalar A rchitecture Fetch Data fp floating point Unit fp C ache P C reg ’ s instruction queue units I nstruction C ache integer integer int. 8 instruction queue reg ’ s units int/ld- store Register Decode units Renaming • Fetch up to 8 • Out-of-order, • I ssue 3 floating instructions per cycle speculative point, 6 integer execution instructions per cycle

  24. A n SMT A rchitecture Fetch Data fp floating point Unit fp C ache P C reg ’ s instruction queue units I nstruction C ache integer integer int. 8 reg ’ s instruction queue units int/ld- store Register Decode units Renaming • Fetch up to 8 • Out-of-order, • I ssue 3 floating instructions per cycle speculative point, 6 integer execution instructions per cycle

  25. Performance of the Naïve Design 5 Throughput ( I nstructions Per C ycle) 4 3 2 Unmodified S uperscalar 1 2 4 6 8 Number of Threads

  26. Bottlenecks of the Baseline A rchitecture • Instruction queue full conditions (12-21% of cycles) • Lack of parallelism in the queue. • Fetch throughput (4.2 instructions per cycle when queue not full)

  27. Improving Fetch Throughput • The fetch unit in an SMT architecture has two distinct advantages over a conventional architecture. • Can fetch from multiple threads at once. • Can choose which threads to fetch.

  28. Improved Fetch Performance • Fetching from 2 threads/cycle achieved most of the performance from multiple-thread fetch. • Fetching from the thread(s) which have the fewest unissued instructions in-flight significantly increases parallelism and throughput (ICOUNT fetch policy)

  29. Improved Performance I mproved 5 I nstructions per cycle 4 Baseline 3 2 Unmodified superscalar 1 2 4 6 8 Number of Threads

  30. The Tera MT A • 256 Processors • 128 threads each • 1 Inst/thread in the pipe at a time • 21 stage pipeline • K ey design points • No caches • Randomized memory space • Full-empty bits on each word of memory • First deployed at SDSC in 1997. 30

  31. Full/Empty Bits • Lightweight synchronization mechanism • The FE bit is 0 until a write occurs • Writes set it to 1. • Loads block until the FE bit is 1. 31

  32. MT A ’s Goals • The MT A was after massive parallelism • Low synchronization costs • Lots of threads • Fast context switches 32

  33. MT A ’s Problems • Many apps don’t have 1000 threads • Those that do, don’t have them all the time • In sequential code, performance was equivalent to a 1Mhz machine with no caches. • What does A mdahl’s law tell us about sequential code? • S = 2000 (for 2000 threads) • x = 0.99 (99% parallizable) • Stot = 95x speedup compared to single MT A thread • A single MT A thread is slow! 33

  34. MT A ’s Problems and Tera’s Recent History • They tried to innovate in too many ways at once • New architecture • New programming paradigm • Exotic Gallium- A rsenide semiconductor technology • Faster than CMOS but much less mature • Tera bought Cray and took their name • Builds more conventional supercomputers • High-performance interconnects 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend