Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang - - PowerPoint PPT Presentation

asymmetry aware work stealing runtimes
SMART_READER_LITE
LIVE PREVIEW

Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang - - PowerPoint PPT Presentation

Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang Wang, and Christopher Batten School of Electrical and Computer Engineering Cornell University 43rd Intl Symp. on Computer Architecture, June 2016 Motivation


slide-1
SLIDE 1

Asymmetry-Aware Work-Stealing Runtimes

Christopher Torng, Moyang Wang, and Christopher Batten

School of Electrical and Computer Engineering Cornell University 43rd Int’l Symp. on Computer Architecture, June 2016

slide-2
SLIDE 2
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling

Dynamic Asymmetry Static Asymmetry

How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?

Cornell University Christopher Torng 2 / 21

slide-3
SLIDE 3
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Task A Spawn Task B

Cornell University Christopher Torng 3 / 21

slide-4
SLIDE 4
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Dequeue Task B

Cornell University Christopher Torng 3 / 21

slide-5
SLIDE 5
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task B Spawn Task C

Cornell University Christopher Torng 3 / 21

slide-6
SLIDE 6
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Spawn Task D Task D Task C

Cornell University Christopher Torng 3 / 21

slide-7
SLIDE 7
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Steal Task D Steal Task C Task B

Cornell University Christopher Torng 3 / 21

slide-8
SLIDE 8
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Spawn Task F Spawn Task E

Cornell University Christopher Torng 3 / 21

slide-9
SLIDE 9
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Steal Task E Steal Task F

◮ Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

◮ Supported in many popular concurrency platforms including:

Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task Parallel Library, Java’s Fork/Join Framework, and OpenMP

Cornell University Christopher Torng 3 / 21

slide-10
SLIDE 10
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Static Asymmetry vs. Dynamic Asymmetry

Samsung Exynos Octa Mobile Processor

Big ARM Cores Little ARM Cores A7 A7 A15 A15 L2$ L2$ A7 A7 A15 A15 Cell Control Power Mux Test Chip with Four Integrated Voltage Regulators Load 100 150 200 250 300 350 400 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 Time (ns) Voltage (V) 120 ns 150 ns

Integrated Voltage Regulation

From W, Godycki, C. Torng, I. Bukreyev, A. Apsel, C. Batten. “Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks” MICRO, 2014 Cornell University Christopher Torng 4 / 21

slide-11
SLIDE 11
  • Motivation •

First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling

Bender et al. "Online Scheduling

  • f Parallel Programs on

Heterogeneous Sys ..." SPAA 2002 Ribic et al. "Energy-Efficient Work-Stealing Language Runtimes" ASPLOS 2014 Azizi et al. "Energy-performance Tradeoffs in Processor Architecture and Circuit Design: A Marginal Cost Analysis" ISCA 2010

How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?

Cornell University Christopher Torng 5 / 21

slide-12
SLIDE 12

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Talk Outline Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Cornell University Christopher Torng 6 / 21

slide-13
SLIDE 13

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Building Intuition by Exploring a 1 Big 1 Little System

Normalized Power Normalized IPS 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

L B System with 1 big 1 little Little Core Four-Way Big Core (2.0, 6.0) (1.0, 1.0) B L 7.0 P

  • w

e r B L 3.0 I P S

Cornell University Christopher Torng 7 / 21

slide-14
SLIDE 14

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Building Intuition by Exploring a 1 Big 1 Little System

Normalized Power Normalized IPS 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

L B System with 1 big 1 little Little Core Four-Way Big Core (2.0, 6.0) (1.0, 1.0) B L 7.0 P

  • w

e r B L 3.0 I P S B L L B

Cornell University Christopher Torng 7 / 21

slide-15
SLIDE 15

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Building Intuition by Exploring a 1 Big 1 Little System

Normalized Power Normalized IPS 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

L B System with 1 big 1 little Little Core Four-Way Big Core (2.0, 6.0) (1.0, 1.0) B L 7.0 P

  • w

e r B L 3.0 I P S B L L B B L L B

Same Power 10% Performance Increase

Cornell University Christopher Torng 7 / 21

slide-16
SLIDE 16

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

The Law of Equi-Marginal Utility

Normalized IPS 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

Alfred Marshall (1824 - 1924)

"Other things being equal, a consumer gets maximum satisfaction when he allocates his limited income to the purchase of different goods in such a way that the Marginal Utility derived from the last unit of money spent on each item of expenditure tend to be equal."

British Economist

Normalized Power

Balance the ratio of utility (IPS) to cost (power)

dy,cost dx,utility Slope Slope 1.0 V 1.0 V

Cornell University Christopher Torng 8 / 21

slide-17
SLIDE 17

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

The Law of Equi-Marginal Utility

Normalized IPS 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

Alfred Marshall (1824 - 1924)

"Other things being equal, a consumer gets maximum satisfaction when he allocates his limited income to the purchase of different goods in such a way that the Marginal Utility derived from the last unit of money spent on each item of expenditure tend to be equal."

British Economist

Normalized Power

Balance the ratio of utility (IPS) to cost (power)

Slope Slope 0.9 V 1.3 V

Arbitrage "Buy Low, Sell High"

Cornell University Christopher Torng 8 / 21

slide-18
SLIDE 18

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Systematic Approach for Balancing Marginal Utility

Normalized Energy Efficiency Normalized IPS 0.6 0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

isopower

1 Big 1 Little System at Nominal voltage

Assumptions

Individual (VB, VL) pair Pareto-Optimal Frontier

Perfectly parallel application Ideal load balancing Performance at expense of energy efficiency Energy efficiency at expense of performance

Cornell University Christopher Torng 9 / 21

slide-19
SLIDE 19

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Systematic Approach for Balancing Marginal Utility

Normalized Energy Efficiency Normalized IPS 0.6 0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

isopower

1 Big 1 Little System at Nominal voltage

Assumptions

Individual (VB, VL) pair Pareto-Optimal Frontier

Perfectly parallel application Ideal load balancing Improve both performance and energy efficiency

Cornell University Christopher Torng 9 / 21

slide-20
SLIDE 20

Motivation

  • First-Order Modeling •

Asymmetry-Aware Work-Stealing Runtimes Evaluation

Systematic Approach for Balancing Marginal Utility

Normalized Energy Efficiency Normalized IPS 0.6 0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

isopower

1 Big 1 Little System at Nominal voltage

Assumptions

Individual (VB, VL) pair Pareto-Optimal Frontier

Perfectly parallel application Ideal load balancing Marginal Utility-Based Optimization Problem Constraint: isopower line Objective: maximize performance Solved numerically

Cornell University Christopher Torng 9 / 21

slide-21
SLIDE 21

Motivation First-Order Modeling

  • Asymmetry-Aware Work-Stealing Runtimes •

Evaluation

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Talk Outline Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Cornell University Christopher Torng 10 / 21

slide-22
SLIDE 22

Motivation First-Order Modeling

  • Asymmetry-Aware Work-Stealing Runtimes •

Evaluation

Work-Pacing: Building Intuition

Balance performance/power across cores in the high-parallel (HP) region

L L B B

Busy Steal Loop 2 Big, 2 Little

System with both big cores active and both little cores active B L

Normalized Power Normalized IPS

0.0 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7

VL VB Marginal IPS/W

0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24

Aggregate Throughput

Aggregate System IPS

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24

VL VB

B L

Cornell University Christopher Torng 11 / 21

slide-23
SLIDE 23

Motivation First-Order Modeling

  • Asymmetry-Aware Work-Stealing Runtimes •

Evaluation

Work-Pacing, Work-Sprinting, and Work-Mugging

L L B B

Steal Loop Busy Work-Pacing Balance performance/power across cores in the high-parallel (HP) region Rest cores in the steal loop to the lowest voltage With additional power slack, balance performance/power across busy cores in the low-parallel (LP) region Work-Sprinting Work-Mugging Move work from slow little cores to fast big cores in the low-parallel (LP) region

Cornell University Christopher Torng 12 / 21

slide-24
SLIDE 24

Motivation First-Order Modeling

  • Asymmetry-Aware Work-Stealing Runtimes •

Evaluation

Work-Pacing and Work-Sprinting Mechanisms

On-Chip Interconnect L1$ L1$ B L Voltage Regulators DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks DVFS Controller

Work in Progress Task Queues

Big Big Little Little

Which cores are stealing? Big or little?

Task A Task B

Hints D e c i s i

  • n

Vdd

B L B L

A A A A

VB VL A = Active, S = Stealing

0.91V 1.30V

A A A s

0.98V 1.30V 0.70V

Stealing

Task C

A A s s A s A A A s A s A s s s s s A A s s A s s s s s

1.03V 1.30V 1.04V 1.30V 1.13V 1.30V 1.21V 0.70V 0.70V 1.30V 0.70V 1.30V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V

Activity Pattern Voltages

Cornell University Christopher Torng 13 / 21

slide-25
SLIDE 25

Motivation First-Order Modeling

  • Asymmetry-Aware Work-Stealing Runtimes •

Evaluation

Work-Mugging Mechanisms

On-Chip Interconnect L1$ L1$ B L DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks Voltage Regulators DVFS Controller User-Level Interrupt Network Initiate Mug Mug Interrupt Thread Context Swap

Mug Instruction

◮ Thread ID to mug ◮ Address of thread-swapping handler

User-Level Interrupt Network

◮ Simple, low-bandwidth inter-core

network

◮ Latency on order of 20 cycles

Thread Context Swap

◮ Threads store architectural state to

separate locations in shared memory

◮ Both threads sync ◮ Threads load architectural state from

  • ther location

Cornell University Christopher Torng 14 / 21

slide-26
SLIDE 26

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Talk Outline Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation

Cornell University Christopher Torng 15 / 21

slide-27
SLIDE 27

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Evaluation Methodology: Modeling

Work-Stealing Runtime

◮ State-of-the-art Intel TBB-inspired work-stealing scheduler ◮ Chase-Lev task queues with occupancy-based victim selection ◮ Instrumented with activity hints

Cycle-Level Modeling

◮ Heterogeneous system modeled in gem5 cycle-approximate simulator ◮ Support for scaling per-core frequencies + central DVFS Controller

Energy Modeling

◮ Event-based energy modeling based on detailed RTL/gate-level sims

(Synopsys ASIC toolflow, TSMC LP , 65 nm 1.0 V)

◮ Carefully selected subset of McPAT results tuned to our µarchitecture

Cornell University Christopher Torng 16 / 21

slide-28
SLIDE 28

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Work-Pacing in cilk-sort

time Big Little

No AAWS Techniques Busy Steal Loop Activity Bar DVFS Controller Decision 0.70 V 1.04 V 1.24 V 1.00 V 1.30 V 0.90 V Work-Pacing

Big Little time

Speedup

  • f 1.11x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.11x Energy Efficiency 1.11x

Cornell University Christopher Torng 17 / 21

slide-29
SLIDE 29

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Work-Sprinting in quicksort

1.00 V 0.70 V 1.04 V 1.24 V Busy Steal Loop 1.30 V 1.10 V

time Big Little

No AAWS Techniques DVFS Controller Decision Activity Bar Work-Sprinting

Big Little time

Speedup

  • f 1.34x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.34x Energy Efficiency 1.16x

Cornell University Christopher Torng 18 / 21

slide-30
SLIDE 30

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Work-Mugging in radix sort

Big Little

No AAWS Techniques

time

Busy Steal Loop Activity Bar DVFS Controller Decision 1.00 V 0.90 V 0.70 V 1.04 V 1.24 V 1.30 V Work-Mugging

Big Little time

Speedup 1.17x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.17x Energy Efficiency 1.40x

Cornell University Christopher Torng 19 / 21

slide-31
SLIDE 31

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Evaluation of Complete AAWS Runtime

0.0 0.2 0.4 0.6 0.8

Normalized Energy Efficiency

0.9 1.0 1.1 1.2 1.3 1.4 1.5

Performance

1.0 1.2 1.1 1.3 1.4 1.5 isopower

pbbs-bfs pbbs-quicksort pbbs-samplesort pbbs-dictionary pbbs-convex-hull pbbs-radix-sort pbbs-knn pbbs-max-independent-set pbbs-nbody pbbs-remove-duplicates pbbs-suffix-array pbbs-spanning-tree cilk-cholesky cilk-cilksort cilk-heat cilk-knapsack cilk-matrix-multiply parsec-blackscholes unbalanced-tree-search

Application Kernels

Median: 1.10 x Max: 1.32 x Performance Median: 1.11 x Max: 1.53 x Energy Efficiency

Cornell University Christopher Torng 20 / 21

slide-32
SLIDE 32

Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes

  • Evaluation •

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Take-Away Point

Holistically combining

  • work-stealing runtimes
  • static asymmetry
  • dynamic asymmetry

through the use of

  • work-pacing
  • work-sprinting
  • work-mugging

can improve both performance and energy efficiency in future multicore systems

This work was partially supported by the NSF , AFOSR, and donations from Intel and Synopsys

Cornell University Christopher Torng 21 / 21