High-Performance Execution of Multithreaded Workloads on CMPs M. - - PowerPoint PPT Presentation

high performance execution of multithreaded workloads on
SMART_READER_LITE
LIVE PREVIEW

High-Performance Execution of Multithreaded Workloads on CMPs M. - - PowerPoint PPT Presentation

High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 How do we use the transistors? More transistors Higher performance core


slide-1
SLIDE 1

1

High-Performance Execution of Multithreaded Workloads on CMPs

  • M. Aater Suleman

Advisor: Yale Patt

HPS Research Group The University of Texas at Austin

slide-2
SLIDE 2

2

How do we use the transistors?

  • More transistors Higher performance core

– Performance increases without programmer effort – Larger cores are complex and consume more power

  • More transistors Bigger cache

– Assist the core by reducing memory accesses – Easier to design and consume less power – Pentium M: 50M out of the 77M were cache

  • More transistors More cores

– Chip Multiprocessors (CMPs) – Less complex – Run at lower frequency (Power α frequency3)

But, do CMPs improve performance?

slide-3
SLIDE 3

3

Multithreading

To leverage CMPs, applications must be split into threads

Single-Threaded

But, can we do this for all applications?

slide-4
SLIDE 4

4

Easy-to-parallelize Kernels

Kernel from ImageMagick GrayscaleToMonochrome (picture) foreach (OldPixel in picture) if( OldPixel > Threshold) NewPixel = 1 else NewPixel = 0

slide-5
SLIDE 5

5

Serial Kernels

Smooth(Picture) for i = 1 to N Pixel[i] = (Pixel[i-1] + Pixel[i])/2 avg avg avg Kernel from ImageMagick

Old pixels: New pixels:

1 2 3 4

slide-6
SLIDE 6

6

Amdahl’s Law

As the number of cores increase, even a small serial part can have significant impact on overall performance

Future CMPs must improve performance

  • f both parallel and serial parts
slide-7
SLIDE 7

7

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Summary
slide-8
SLIDE 8

8

Current CMP Architectures

slide-9
SLIDE 9

9

Current CMP Architectures

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

  • Tile many small cores
  • Sun Niagara Processor
  • High throughput on the parallel part
  • Low performance on the serial part
slide-10
SLIDE 10

10

Current CMP Architectures

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

slide-11
SLIDE 11

11

Current CMP Architectures

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

  • Tile a few large cores
  • IBM Power 5, AMD Barcelona, Intel Core2Quad
  • High performance on the serial part
  • Low throughput on the parallel part

Large core Large core Large core Large core

“Tile-Large”Approach

slide-12
SLIDE 12

12

Current CMP Architectures

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

Large core Large core Large core Large core

“Tile-Large”Approach

slide-13
SLIDE 13

13

The Asymmetric Chip Multiprocessor (ACMP)

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

  • Provide one large core and many small cores
  • Accelerate serial part using the large core
  • Execute parallel part on small cores

for high throughput

Large core Large core Large core Large core

“Tile-Large”Approach

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

ACMP Approach

slide-14
SLIDE 14

14

The Asymmetric Chip Multiprocessor (ACMP)

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

Large core Large core Large core Large core

“Tile-Large”Approach

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

ACMP Approach

slide-15
SLIDE 15

15

The Asymmetric Chip Multiprocessor (ACMP)

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

  • Analytical experiment details

– One large core replaces four small cores – Large core provides 2x performance

Large core Large core Large core Large core

“Tile-Large”Approach

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

ACMP Approach

slide-16
SLIDE 16

16

1 2 3 4 5 6 7 8 9 0.2 0.4 0.6 0.8 1

Degree of Parallelism

Speedup vs. 1 Large Core

Niagara Tile-Large ACMP

Performance vs. Parallelism

Both ACMP and Tile-Large

  • utperform Niagara

At high parallelism, Niagara

  • utperforms ACMP

At medium parallelism, ACMP wins Niagara beats ACMP at 97% parallelism

slide-17
SLIDE 17

17

Throughput of ACMP vs. Niagara

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

slide-18
SLIDE 18

18

ACMP Scheduling

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

ACMP Approach

slide-19
SLIDE 19

19

Data Transfers in ACMP

  • Data is transferred if the serial part requires the

data generated by the parallel part or vice-versa

  • ACMP

– Data is transferred from all small cores

  • Niagara/Tile-Large

– Data is transferred from all but one core

  • Number of data transfers increases by only 3.8%
slide-20
SLIDE 20

20

Experimental Methodology

  • Configurations:

– Niagara: 16 small cores – Tile-Large: 4 large cores – ACMP: 1 large core, 12 small cores

  • Simulated existing multithreaded applications

without modification

  • Simulation parameters:

– x86 cycle accurate processor simulator – Large core: 2GHz, out-of-order, 128-entry window, 4-wide issue, 12-stage pipeline – Small core: 2GHz, in-order, 2-wide, 5-stage pipeline – Private 32 KB L1, private 256KB L2 – On-chip interconnect: Bi-directional ring

slide-21
SLIDE 21

21

Performance Results

0.2 0.4 0.6 0.8 1 1.2 1.4

m c f i s _ n a s p f f t _ s p l a s h c g _ n a s p e p _ n a s p a r t _

  • m

p m g _ n a s p f m m _ s p l a s h c h

  • l

e s k y p a g e c

  • n

v e r t h . 2 6 4 e d

Speedup over Niagara

Tile-Large ACMP

Low Parallelism Medium Parallelism High Parallelism

slide-22
SLIDE 22

22

Impact of ACMP on Programmer Effort

  • ACMP makes performance less dependent
  • n length of the serial part
  • Programmers parallelize the easy-to-parallelize kernels
  • Hardware accelerates the difficult-to-parallelize serial part
  • Higher performance can be achieved with less effort
slide-23
SLIDE 23

23

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Summary
slide-24
SLIDE 24

24

How Many Threads?

  • Some applications:

– As many threads as the number of cores

  • Other applications:

– Performance saturates – Fewer threads than cores

The number of threads must be chosen carefully

slide-25
SLIDE 25

25

Two Important Limitations

  • Contention for shared data

– Data synchronization: Critical section

  • Contention for shared resources

– Off-chip bus

slide-26
SLIDE 26

26

Contention for Critical Section

Kernel from PageMine

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

… a b c d

  • ccurrences

slide-27
SLIDE 27

27

Contention for Critical Section

Critical Section: Add local histogram to global histogram GetPageHistogram(Page *P) UpdateLocalHistogram(Fraction of Page) Barrier

Parallel Part Serial Part

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

… a b c d

  • ccurrences

Kernel from PageMine

slide-28
SLIDE 28

28

N = 1 N = 2 N = 4 N = 8

Contention for Critical Section

Time outside Critical Section (CS) Time inside CS

slide-29
SLIDE 29

29

Two Important Limitations

  • Contention for shared data

– Data-synchronization: Critical section

  • Contention for shared resources

– Off-chip bus

slide-30
SLIDE 30

30

Off-Chip Bandwidth

Main Memory

Off-Chip Bus

slide-31
SLIDE 31

31

Contention for Off-chip Bus

Kernel from ED EuclideanDistance (Point A) for i = 1 to num_dimensions sum = sum + A[i] * A[i]

slide-32
SLIDE 32

32

Contention for Off-chip Bus

N = 1 N = 2 N = 4 N = 8 N=4 and N=8 take same time to execute

slide-33
SLIDE 33

33

Who Chooses Number of Threads?

  • Programmer

– No! Not for general-purpose workloads Large variation in input data and machines

  • User

– No! I do not want Windows media player to ask me the number of threads

  • Set equal to the number of cores

– Assumption:

More threads More performance

Goal: A run-time mechanism to estimate the best number of threads

slide-34
SLIDE 34

34

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Synchronization-Aware Threading (SAT)
  • Bandwidth-Aware Threading (BAT)
  • Combining SAT and BAT (SAT+BAT)
  • Summary
slide-35
SLIDE 35

35

Feedback-Driven Threading (FDT)

Conventional Multithreading Feedback-Driven Threading N = K N = No. of threads K = No. of cores

slide-36
SLIDE 36

36

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Synchronization-Aware Threading (SAT)
  • Bandwidth-Aware Threading (BAT)
  • Combining SAT and BAT (SAT+BAT)
  • Summary
slide-37
SLIDE 37

37

Synchronization-Aware Threading (SAT)

N = 1 N = 2 N = 4 N = 8

Time outside C.S. N TN = + N x Time inside C.S. Time outside C.S. Time inside C.S

NCS =

slide-38
SLIDE 38

38

Implementing SAT using FDT

  • Train

– Measure the time inside and outside the critical section using cycle counter

  • Compute NCS =
  • Execute

Time outside C.S. Time inside C.S

slide-39
SLIDE 39

39

Machine Configuration

  • CMP: 32 in-order cores (2-wide, 5-stage deep)
  • Caches: L1: 8-KB, L2: 64KB. Shared L3: 8MB
  • Off-chip bus: 64-bit wide, 4x slower than cores
  • Memory: 200 cycle minimum latency
slide-40
SLIDE 40

40

Results of SAT

  • Norm. Exec. Time
  • Norm. Exec. Time

PageMine (Data Mining) ISort (NAS) EP (NAS) GSearch (OSR)

SAT decreases execution time and saves power

slide-41
SLIDE 41

41

Adaptation of SAT to Input Data

  • Time inside and outside the critical section depends on

the input to program

  • For PageMine, the best number of threads changes with

the page size

slide-42
SLIDE 42

42

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Synchronization-Aware Threading (SAT)
  • Bandwidth-Aware Threading (BAT)
  • Combining SAT and BAT (SAT+BAT)
  • Summary
slide-43
SLIDE 43

43

Bandwidth-Aware Threading (BAT)

N = 1 N = 2 N = 4 N = 8 N=4 and N=8 take same time to execute

Total Bandwidth Bandwidth used by a single thread NBW =

slide-44
SLIDE 44

44

Implementation BAT using FDT

  • Train

– Measure bandwidth utilization using performance counters

  • Compute NBW =
  • Execute

Total Bandwidth Bandwidth used by a single thread

slide-45
SLIDE 45

45

Results of BAT

  • Norm. Exec. Time
  • Norm. Exec. Time

ED (Numerical) convert (ImageMagick) MTwister (nVIDIA) Transpose (nVIDIA)

BAT saves power without increasing execution time

slide-46
SLIDE 46

46

Adaptation of BAT to System Configuration

  • The best number of threads is a function of
  • ff-chip bandwidth
  • BAT correctly predicts the best number of threads for

systems with different bandwidth

convert (ImageMagick)

slide-47
SLIDE 47

47

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Synchronization-Aware Threading (SAT)
  • Bandwidth-Aware Threading (BAT)
  • Combining SAT and BAT (SAT+BAT)
  • Summary
slide-48
SLIDE 48

48

Combining SAT and BAT

  • Train

– Train for both SAT and BAT

  • Compute

NSAT+BAT = MIN (NCS, NBW, Num. cores)

  • Execute
slide-49
SLIDE 49

49

Results of SAT+BAT

Fewer threads fewer cache misses (SAT+BAT) reduces power and execution time On average, (SAT+BAT) reduces the execution time by 17% and power by 59%

  • Norm. to 32 threads
slide-50
SLIDE 50

50

Comparison with Static-Best

Simulate all possible number of threads and choose the best

Two kernels: First needs 12 threads, second needs 32. Static-Best uses 32 for both.

  • Norm. to 32 threads

(SAT+BAT) Exec. Time Static-Best Exec. Time (SAT+BAT) power Static-Best power

slide-51
SLIDE 51

51

Outline

  • Background
  • Speeding up serial part

– Asymmetric Chip Multiprocessor (ACMP)

  • Speeding up parallel part

– Feedback-Driven Threading (FDT)

  • Synchronization-Aware Threading (SAT)
  • Bandwidth-Aware Threading (BAT)
  • Combining SAT and BAT (SAT+BAT)
  • Summary
slide-52
SLIDE 52

52

Summary

  • CMPs have increased the importance of multithreading
  • Performance of both serial and parallel parts is important
  • Asymmetric Chip Multiprocessor (ACMP)

– Accelerates the serial portion using a high-performance core – Provides high throughput on the parallel portion using multiple small cores

  • Feedback-Driven Threading (FDT)

– Estimates best number of threads at run-time – Adapts to input sets and machine configurations – Does not require programmer/user intervention

slide-53
SLIDE 53

53

  • Thank You