Multithreading 1 A rchitectural State and Context Switches 2 A - - PowerPoint PPT Presentation

multithreading
SMART_READER_LITE
LIVE PREVIEW

Multithreading 1 A rchitectural State and Context Switches 2 A - - PowerPoint PPT Presentation

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A rchitectural State of a thread is everything that defines the state of a running program The contents of the register file The current


slide-1
SLIDE 1

Multithreading

1

slide-2
SLIDE 2

Architectural State and Context Switches

2

slide-3
SLIDE 3

Architectural State

  • The “Architectural State” of a thread is everything

that defines the state of a running program

  • The contents of the register file
  • The current program counter
  • The current contents of memory (it’s “address space”)
  • Note that all these are well-defined because the semantics of the

ISA dictate that instructions execute one-at-a-time.

  • The architectural state of a processor includes this

and other privileged state.

  • A thread is running on a processor if the processor’s

register file, PC, and current contents of memory are those of the thread

3

slide-4
SLIDE 4

Context Switches

  • Switch out
  • Save the current register

state

  • Save the current PC
  • Switch in
  • Install new register state
  • Install new address space

(Sometimes)

  • Jump to new PC

4

slide-5
SLIDE 5

Kinds of Context Switches

  • Userspace context switch
  • This is really just another form of control transfer
  • Registers and PC only
  • Userspace threads share an address space, so no need to change
  • Example: setjump()/longjump(), userspace threading packages
  • ~1us.
  • Thread switch
  • Happens in the kernel – each thread is a “light weight process”
  • Register and PC only
  • Threads in the same process share an address space so no need to change
  • ~3us
  • Process switch
  • Happens in the kernel
  • Register, PC, and address space
  • A bunch of other stuff too – file descriptors, etc.
  • ~3us

5

slide-6
SLIDE 6

Hidden Costs of Context Switches

  • Latency for registers, PC, and

memory are unavoidable and relatively easy to measure

  • But the cost of “warming” the

caches and predictors is non- trivial

6

[Brown ’10] [Choi’08]

slide-7
SLIDE 7

Parallelism’s Granularity and Abundance

7

slide-8
SLIDE 8

Communication Between Threads is Expensive

  • Context switches are

expensive

  • (see previous slides)
  • Coherence is expensive
  • On the order of accessing main

memory

  • Communication is required

to exploit parallelism

  • Each “fork” and “join” point in

the the dependence graph requires communication.

8

slide-9
SLIDE 9

Parallelism Granularity (ILP)

Finest (1 inst) Fine (several inst)

For(i = 1..5) { s[i] = a[i] + b[i] + c[i]+…; }

slide-10
SLIDE 10

Parallelism Granularity (TLP)

10

Coarsest (millions of inst)

Par_msum (matrix A, Matrix B) { matrix R; R[upleft] = Spawn(msum(a[upleft],b[upleft]) R[upright] = Spawn(msum(a[upright],b[upright]) R[lowleft] = Spawn(msum(a[lowleft],b[lowleft]) R[lowright] = Spawn(msum(a[lowright],b[lowright]) Wait(the_barrier) }

Coarse (1000s of inst)

slide-11
SLIDE 11

Parallelism Granularity

  • All kinds of parallelism are (in general) hard for

people and computers to find.

  • We would like to exploit parallelism wherever it

is.

  • Fine-grain: Modern processors are pretty good at

ILP

  • An infinitely large instruction window could exploit all

available parallelism.

  • Instruction window size limits the scope of which they can

find ILP

  • Coarse-grain: Pthreads, fork(), and openMP are
  • k options.
  • Context switch costs place a lower-bound on grain size

that is profitable to exploit

  • So, we would like to lower the cost of

synchronization/communication.

11

slide-12
SLIDE 12

Apps Vs. Cores

  • App variation
  • Some apps have lots of ILP (floating point)
  • Some apps have very little ILP (gcc) or any app with

very poor cache performance

  • An “average” threads has an IPC of about 0.5-2
  • Some apps have lots of threads (Apache)
  • Some have just one (most apps)
  • Core variation
  • One big, expensive, power-hungry cores is good at

ILP

  • Many small, cores are better at TLP.
  • How can we have the best of both worlds?
  • Use bigg(er) cores, convert TLP into ILP

12

slide-13
SLIDE 13

Motivation

100 90 80 70 60 50 40 30 20 10

processor busy itlb miss dtlb miss icache miss dcache miss branch mispred. control hazards load delays short int long int short fp long fp mem conflict

Percent of total issue cycles

slide-14
SLIDE 14

Hardware Multithreading

Conventional Processor Multithreaded Processor

PC

regs

PC

regs

PC

regs

PC

regs

PC

regs CPU CPU instruction stream

instruction stream

slide-15
SLIDE 15

Superscalar Execution

Issue Slots

Time (proc cycles)

slide-16
SLIDE 16

Superscalar Execution

Issue Slots

Time (proc cycles)

Horizontal waste Vertical waste

slide-17
SLIDE 17

Superscalar Execution

Time (proc cycles)

Issue Slots

slide-18
SLIDE 18

Superscalar Execution with Fine-Grain Multithreading

  • The processor has

multiple thread contexts

  • PC
  • Register set
  • Memory space
  • Context switch time

is one cycle (~300ps)

  • Instructions flow the

pipeline together

slide-19
SLIDE 19

Superscalar Execution with Fine-Grain Multithreading

Issue Slots

Time (proc cycles)

Thread 1 Thread 2 Thread 3 Horizontal waste

slide-20
SLIDE 20

Simultaneous Multithreading

Issue Slots

Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Time (proc cycles)

  • Same multiple

contexts

  • Fetch from multiple

threads per cycle

  • Instructions all flow

through the pipeline together.

slide-21
SLIDE 21

The Potential for SMT

slide-22
SLIDE 22

Goals

  • SMT Goals
  • 1. Minimize the architectural impact on conventional

superscalar design.

  • 2. Minimize the performance impact on a single

thread.

  • 3. Achieve significant throughput gains with many

threads.

slide-23
SLIDE 23

A Conventional Superscalar Architecture

Instruction Cache 8 Decode Register Renaming floating point instruction queue integer instruction queue fp units int. units PC Fetch Unit int/ld- store units Data Cache integer reg’s fp reg’s

  • Fetch up to 8

instructions per cycle

  • Issue 3 floating

point, 6 integer instructions per cycle

  • Out-of-order,

speculative execution

slide-24
SLIDE 24

An SMT Architecture

Instruction Cache 8 Decode Register Renaming floating point instruction queue integer instruction queue fp units int. units Fetch Unit int/ld- store units Data Cache integer reg’s fp reg’s

  • Fetch up to 8

instructions per cycle

  • Issue 3 floating

point, 6 integer instructions per cycle

  • Out-of-order,

speculative execution

PC

slide-25
SLIDE 25

Performance of the Naïve Design

1 2 3 4 5 2 4 6 8 Number of Threads Unmodified Superscalar

Throughput (Instructions Per Cycle)

slide-26
SLIDE 26

Bottlenecks of the Baseline Architecture

  • Instruction queue full conditions (12-21% of

cycles)

  • Lack of parallelism in the queue.
  • Fetch throughput (4.2 instructions per cycle

when queue not full)

slide-27
SLIDE 27

Improving Fetch Throughput

  • The fetch unit in an SMT architecture has two

distinct advantages over a conventional architecture.

  • Can fetch from multiple threads at once.
  • Can choose which threads to fetch.
slide-28
SLIDE 28

Improved Fetch Performance

  • Fetching from 2 threads/cycle achieved most
  • f the performance from multiple-thread

fetch.

  • Fetching from the thread(s) which have the

fewest unissued instructions in-flight significantly increases parallelism and throughput (ICOUNT fetch policy)

slide-29
SLIDE 29

Improved Performance

1 2 3 4 5 2 4 6 8 Number of Threads Improved Baseline Unmodified superscalar

Instructions per cycle

slide-30
SLIDE 30

The Tera MTA

  • 256 Processors
  • 128 threads each
  • 1 Inst/thread in the pipe at a

time

  • 21 stage pipeline
  • Key design points
  • No caches
  • Randomized memory space
  • Full-empty bits on each word of

memory

  • First deployed at SDSC in

1997.

30

slide-31
SLIDE 31

Full/Empty Bits

  • Lightweight synchronization mechanism
  • The FE bit is 0 until a write occurs
  • Writes set it to 1.
  • Loads block until the FE bit is 1.

31

slide-32
SLIDE 32

MTA’s Goals

  • The MTA was after massive parallelism
  • Low synchronization costs
  • Lots of threads
  • Fast context switches

32

slide-33
SLIDE 33

MTA’s Problems

  • Many apps don’t have 1000 threads
  • Those that do, don’t have them all the time
  • In sequential code, performance was equivalent to a

1Mhz machine with no caches.

  • What does Amdahl’s law tell us about sequential

code?

  • S = 2000 (for 2000 threads)
  • x = 0.99 (99% parallizable)
  • Stot = 95x speedup compared to single MTA thread
  • A single MTA thread is slow!

33

slide-34
SLIDE 34

MTA’s Problems and Tera’s Recent History

  • They tried to innovate in too many ways at
  • nce
  • New architecture
  • New programming paradigm
  • Exotic Gallium-Arsenide semiconductor technology
  • Faster than CMOS but much less mature
  • Tera bought Cray and took their name
  • Builds more conventional supercomputers
  • High-performance interconnects

34