1
High-Performance Execution of Multithreaded Workloads on CMPs
- M. Aater Suleman
Advisor: Yale Patt
HPS Research Group The University of Texas at Austin
High-Performance Execution of Multithreaded Workloads on CMPs M. - - PowerPoint PPT Presentation
High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 How do we use the transistors? More transistors Higher performance core
1
Advisor: Yale Patt
HPS Research Group The University of Texas at Austin
2
– Performance increases without programmer effort – Larger cores are complex and consume more power
– Assist the core by reducing memory accesses – Easier to design and consume less power – Pentium M: 50M out of the 77M were cache
– Chip Multiprocessors (CMPs) – Less complex – Run at lower frequency (Power α frequency3)
3
To leverage CMPs, applications must be split into threads
Single-Threaded
4
Kernel from ImageMagick GrayscaleToMonochrome (picture) foreach (OldPixel in picture) if( OldPixel > Threshold) NewPixel = 1 else NewPixel = 0
5
Smooth(Picture) for i = 1 to N Pixel[i] = (Pixel[i-1] + Pixel[i])/2 avg avg avg Kernel from ImageMagick
Old pixels: New pixels:
1 2 3 4
6
As the number of cores increase, even a small serial part can have significant impact on overall performance
7
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
8
9
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
10
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
11
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
Large core Large core Large core Large core
“Tile-Large”Approach
12
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
Large core Large core Large core Large core
“Tile-Large”Approach
13
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
for high throughput
Large core Large core Large core Large core
“Tile-Large”Approach
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
Large core
ACMP Approach
14
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
Large core Large core Large core Large core
“Tile-Large”Approach
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
Large core
ACMP Approach
15
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
“Niagara” Approach
– One large core replaces four small cores – Large core provides 2x performance
Large core Large core Large core Large core
“Tile-Large”Approach
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
Large core
ACMP Approach
16
1 2 3 4 5 6 7 8 9 0.2 0.4 0.6 0.8 1
Degree of Parallelism
Speedup vs. 1 Large Core
Niagara Tile-Large ACMP
Both ACMP and Tile-Large
At high parallelism, Niagara
At medium parallelism, ACMP wins Niagara beats ACMP at 97% parallelism
17
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
Large core
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
18
Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core Niagara
core
Large core
ACMP Approach
19
data generated by the parallel part or vice-versa
– Data is transferred from all small cores
– Data is transferred from all but one core
20
– Niagara: 16 small cores – Tile-Large: 4 large cores – ACMP: 1 large core, 12 small cores
without modification
– x86 cycle accurate processor simulator – Large core: 2GHz, out-of-order, 128-entry window, 4-wide issue, 12-stage pipeline – Small core: 2GHz, in-order, 2-wide, 5-stage pipeline – Private 32 KB L1, private 256KB L2 – On-chip interconnect: Bi-directional ring
21
0.2 0.4 0.6 0.8 1 1.2 1.4
m c f i s _ n a s p f f t _ s p l a s h c g _ n a s p e p _ n a s p a r t _
p m g _ n a s p f m m _ s p l a s h c h
e s k y p a g e c
v e r t h . 2 6 4 e d
Speedup over Niagara
Tile-Large ACMP
Low Parallelism Medium Parallelism High Parallelism
22
23
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
24
– As many threads as the number of cores
– Performance saturates – Fewer threads than cores
25
– Data synchronization: Critical section
– Off-chip bus
26
Kernel from PageMine
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.… a b c d
∑
27
Critical Section: Add local histogram to global histogram GetPageHistogram(Page *P) UpdateLocalHistogram(Fraction of Page) Barrier
Parallel Part Serial Part
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.… a b c d
∑
Kernel from PageMine
28
N = 1 N = 2 N = 4 N = 8
Time outside Critical Section (CS) Time inside CS
29
– Data-synchronization: Critical section
– Off-chip bus
30
Off-Chip Bus
31
Kernel from ED EuclideanDistance (Point A) for i = 1 to num_dimensions sum = sum + A[i] * A[i]
32
N = 1 N = 2 N = 4 N = 8 N=4 and N=8 take same time to execute
33
– No! Not for general-purpose workloads Large variation in input data and machines
– No! I do not want Windows media player to ask me the number of threads
– Assumption:
More threads More performance
Goal: A run-time mechanism to estimate the best number of threads
34
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
35
Conventional Multithreading Feedback-Driven Threading N = K N = No. of threads K = No. of cores
36
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
37
N = 1 N = 2 N = 4 N = 8
Time outside C.S. N TN = + N x Time inside C.S. Time outside C.S. Time inside C.S
NCS =
38
– Measure the time inside and outside the critical section using cycle counter
Time outside C.S. Time inside C.S
39
40
PageMine (Data Mining) ISort (NAS) EP (NAS) GSearch (OSR)
SAT decreases execution time and saves power
41
the input to program
the page size
42
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
43
N = 1 N = 2 N = 4 N = 8 N=4 and N=8 take same time to execute
Total Bandwidth Bandwidth used by a single thread NBW =
44
– Measure bandwidth utilization using performance counters
Total Bandwidth Bandwidth used by a single thread
45
ED (Numerical) convert (ImageMagick) MTwister (nVIDIA) Transpose (nVIDIA)
BAT saves power without increasing execution time
46
systems with different bandwidth
convert (ImageMagick)
47
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
48
– Train for both SAT and BAT
NSAT+BAT = MIN (NCS, NBW, Num. cores)
49
Fewer threads fewer cache misses (SAT+BAT) reduces power and execution time On average, (SAT+BAT) reduces the execution time by 17% and power by 59%
50
Simulate all possible number of threads and choose the best
Two kernels: First needs 12 threads, second needs 32. Static-Best uses 32 for both.
(SAT+BAT) Exec. Time Static-Best Exec. Time (SAT+BAT) power Static-Best power
51
– Asymmetric Chip Multiprocessor (ACMP)
– Feedback-Driven Threading (FDT)
52
– Accelerates the serial portion using a high-performance core – Provides high throughput on the parallel portion using multiple small cores
– Estimates best number of threads at run-time – Adapts to input sets and machine configurations – Does not require programmer/user intervention
53