Multithreading 1 A rchitectural State and Context Switches 2 A - PowerPoint PPT Presentation

Multithreading 1

A rchitectural State and Context Switches 2

A rchitectural State • The “ A rchitectural State” of a thread is everything that defines the state of a running program • The contents of the register file • The current program counter • The current contents of memory (it’s “address space”) • Note that all these are well-defined because the semantics of the IS A dictate that instructions execute one-at-a-time. • The architectural state of a processor includes this and other privileged state. • A thread is running on a processor if the processor’s register file, PC, and current contents of memory are those of the thread 3

Context Switches • Switch out • Save the current register state • Save the current PC • Switch in • Install new register state • Install new address space (Sometimes) • Jump to new PC 4

K inds of Context Switches • Userspace context switch • This is really just another form of control transfer • Registers and PC only • Userspace threads share an address space, so no need to change • Example: setjump()/longjump(), userspace threading packages • ~1us. • Thread switch • Happens in the kernel – each thread is a “light weight process” • Register and PC only • Threads in the same process share an address space so no need to change • ~3us • Process switch • Happens in the kernel • Register, PC, and address space • A bunch of other stuff too – file descriptors, etc. • ~3us 5

Hidden Costs of Context Switches • Latency for registers, PC, and memory are unavoidable and relatively easy to measure • But the cost of “warming” the caches and predictors is non- trivial [Choi’08] [Brown ’10] 6

Parallelism’s Granularity and A bundance 7

Communication Between Threads is Expensive • Context switches are expensive • (see previous slides) • Coherence is expensive • On the order of accessing main memory • Communication is required to exploit parallelism • Each “fork” and “join” point in the the dependence graph requires communication. 8

Parallelism Granularity (ILP) For(i = 1..5) { s[i] = a[i] + b[i] + c[i]+…; } Fine (several inst) Finest (1 inst)

Parallelism Granularity (TLP) Par_msum (matrix A , Matrix B) { matrix R; R[upleft] = Spawn(msum(a[upleft],b[upleft]) R[upright] = Spawn(msum(a[upright],b[upright]) R[lowleft] = Spawn(msum(a[lowleft],b[lowleft]) R[lowright] = Spawn(msum(a[lowright],b[lowright]) Coarsest Wait(the_barrier) } (millions of inst) Coarse (1000s of inst) 10

Parallelism Granularity • A ll kinds of parallelism are (in general) hard for people and computers to find. • We would like to exploit parallelism wherever it is. • Fine-grain: Modern processors are pretty good at ILP • A n infinitely large instruction window could exploit all available parallelism. • Instruction window size limits the scope of which they can find ILP • Coarse-grain: Pthreads, fork(), and openMP are ok options. • Context switch costs place a lower-bound on grain size that is profitable to exploit • So, we would like to lower the cost of synchronization/communication. 11

A pps Vs. Cores • A pp variation • Some apps have lots of ILP (floating point) • Some apps have very little ILP (gcc) or any app with very poor cache performance • A n “average” threads has an IPC of about 0.5-2 • Some apps have lots of threads ( A pache) • Some have just one (most apps) • Core variation • One big, expensive, power-hungry cores is good at ILP • Many small, cores are better at TLP. • How can we have the best of both worlds? • Use bigg(er) cores, convert TLP into ILP 12

Motivation 100 90 mem conflict Percent of total issue cycles 80 long fp short fp 70 long int short int 60 load delays 50 control hazards branch mispred. 40 dcache miss icache miss 30 dtlb miss 20 itlb miss processor busy 10 0

Hardware Multithreading C onventional instruction Multithreaded instruction stream stream Processor Processor P C P C P C P C regs regs regs regs P C regs C PU C PU

S uperscalar Execution Issue Slots Time (proc cycles)

S uperscalar Execution Issue Slots Time (proc cycles) Horizontal waste Vertical waste

Superscalar Execution Issue Slots Time (proc cycles)

Superscalar Execution with Fine-Grain Multithreading • The processor has multiple thread contexts • PC • Register set • Memory space • Context switch time is one cycle (~300ps) • Instructions flow the pipeline together

S uperscalar Execution with Fine-Grain Multithreading Issue Slots Time (proc cycles) Horizontal waste Thread 1 Thread 2 Thread 3

S imultaneous Multithreading • Same multiple contexts • Fetch from multiple Issue Slots Time (proc cycles) threads per cycle • Instructions all flow Thread 1 through the pipeline Thread 2 together. Thread 3 Thread 4 Thread 5

The Potential for S MT

Goals • SMT Goals • 1. Minimize the architectural impact on conventional superscalar design. • 2. Minimize the performance impact on a single thread. • 3. A chieve significant throughput gains with many threads.

A Conventional Superscalar A rchitecture Fetch Data fp floating point Unit fp C ache P C reg ’ s instruction queue units I nstruction C ache integer integer int. 8 instruction queue reg ’ s units int/ld- store Register Decode units Renaming • Fetch up to 8 • Out-of-order, • I ssue 3 floating instructions per cycle speculative point, 6 integer execution instructions per cycle

A n SMT A rchitecture Fetch Data fp floating point Unit fp C ache P C reg ’ s instruction queue units I nstruction C ache integer integer int. 8 reg ’ s instruction queue units int/ld- store Register Decode units Renaming • Fetch up to 8 • Out-of-order, • I ssue 3 floating instructions per cycle speculative point, 6 integer execution instructions per cycle

Performance of the Naïve Design 5 Throughput ( I nstructions Per C ycle) 4 3 2 Unmodified S uperscalar 1 2 4 6 8 Number of Threads

Bottlenecks of the Baseline A rchitecture • Instruction queue full conditions (12-21% of cycles) • Lack of parallelism in the queue. • Fetch throughput (4.2 instructions per cycle when queue not full)

Improving Fetch Throughput • The fetch unit in an SMT architecture has two distinct advantages over a conventional architecture. • Can fetch from multiple threads at once. • Can choose which threads to fetch.

Improved Fetch Performance • Fetching from 2 threads/cycle achieved most of the performance from multiple-thread fetch. • Fetching from the thread(s) which have the fewest unissued instructions in-flight significantly increases parallelism and throughput (ICOUNT fetch policy)

Improved Performance I mproved 5 I nstructions per cycle 4 Baseline 3 2 Unmodified superscalar 1 2 4 6 8 Number of Threads

The Tera MT A • 256 Processors • 128 threads each • 1 Inst/thread in the pipe at a time • 21 stage pipeline • K ey design points • No caches • Randomized memory space • Full-empty bits on each word of memory • First deployed at SDSC in 1997. 30

Full/Empty Bits • Lightweight synchronization mechanism • The FE bit is 0 until a write occurs • Writes set it to 1. • Loads block until the FE bit is 1. 31

MT A ’s Goals • The MT A was after massive parallelism • Low synchronization costs • Lots of threads • Fast context switches 32

MT A ’s Problems • Many apps don’t have 1000 threads • Those that do, don’t have them all the time • In sequential code, performance was equivalent to a 1Mhz machine with no caches. • What does A mdahl’s law tell us about sequential code? • S = 2000 (for 2000 threads) • x = 0.99 (99% parallizable) • Stot = 95x speedup compared to single MT A thread • A single MT A thread is slow! 33

MT A ’s Problems and Tera’s Recent History • They tried to innovate in too many ways at once • New architecture • New programming paradigm • Exotic Gallium- A rsenide semiconductor technology • Faster than CMOS but much less mature • Tera bought Cray and took their name • Builds more conventional supercomputers • High-performance interconnects 34

Multithreading 1 A rchitectural State and Context Switches 2 A - PowerPoint PPT Presentation

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A rchitectural State of a thread is everything that defines the state of a running program The contents of the register file The current

MULTITHREADING ON IOS AGENDA Multithreading Basics Interlude: Closures Multithreading on iOS

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

Multithreading Checkout Multithreading project from SVN Joe Armstrong, Programming in

Multithreading Basics thread state: runnable, blocked Multithreading start, sleep,

Multithreading Horstmann ch.9 Multithreading Threads Thread states Thread

Multithreading programming Jan Faigl Department of Computer Science Faculty of Electrical

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl

Lecture 10: Multithreading and Condition Variables The Dining Philosophers Problem This is a

Human Multithreading Pascal Van Cauwenberghe Programmed by Thien Que Nguyen and Pascal Van

Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the

Virtualizing the CPU: Scheduling, Context Switching & Multithreading Nima Honarmand Spring

Multithreading in Qt Doing it wrong, debugging it, doing it right David Faure

CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING? Bj orn D obel and Hermann H

Principles of Software Construction: A Brief Introduction to Multithreading and GUI Programming

Processes, Exceptional Control Flow CSAPPe2, Chapter 8 Plan for Today Exceptional Control Flow

Processes What are they? How do we represent them? Scheduling Something smaller

Operating Systems , a 240 view Processes barely scraping the surface Key abstractions provided by

Chapter 8 System Software Chapter 8 Objectives Become familiar with the functions provided by

Why do we need another programing model ? Atsushi Hori Min Si Riken ANL B. Gerofi, M. Takagi,

The Structure of the THE Multiprogramming System p g g y THE (Technische

CPU Virtualization: Process Implementation Prof. Patrick G. Bridges 1 University of New Mexico

Operating systems The operating system controls resources : who gets the CPU; when I/O

Multithreading 1 A rchitectural State and Context Switches 2 A - PowerPoint PPT Presentation

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A rchitectural State of a thread is everything that defines the state of a running program The contents of the register file The current

MULTITHREADING ON IOS AGENDA Multithreading Basics Interlude: Closures Multithreading on iOS

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

Multithreading Recursion Checkout Multithreading and Recursion project from SVN Joe Armstrong,

Multithreading Checkout Multithreading project from SVN Joe Armstrong, Programming in

Multithreading Basics thread state: runnable, blocked Multithreading start, sleep,

Multithreading Horstmann ch.9 Multithreading Threads Thread states Thread

Multithreading programming Jan Faigl Department of Computer Science Faculty of Electrical

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl

Lecture 10: Multithreading and Condition Variables The Dining Philosophers Problem This is a

Human Multithreading Pascal Van Cauwenberghe Programmed by Thien Que Nguyen and Pascal Van

Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the

Virtualizing the CPU: Scheduling, Context Switching &amp; Multithreading Nima Honarmand Spring

Multithreading in Qt Doing it wrong, debugging it, doing it right David Faure

CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING? Bj orn D obel and Hermann H

Principles of Software Construction: A Brief Introduction to Multithreading and GUI Programming

Processes, Exceptional Control Flow CSAPPe2, Chapter 8 Plan for Today Exceptional Control Flow

Processes What are they? How do we represent them? Scheduling Something smaller

Operating Systems , a 240 view Processes barely scraping the surface Key abstractions provided by

Chapter 8 System Software Chapter 8 Objectives Become familiar with the functions provided by

Why do we need another programing model ? Atsushi Hori Min Si Riken ANL B. Gerofi, M. Takagi,

The Structure of the THE Multiprogramming System p g g y THE (Technische

CPU Virtualization: Process Implementation Prof. Patrick G. Bridges 1 University of New Mexico

Operating systems The operating system controls resources : who gets the CPU; when I/O

Virtualizing the CPU: Scheduling, Context Switching & Multithreading Nima Honarmand Spring