The Computational Sprinting Game Songchun Fan , Seyed Majid Zahedi - - PowerPoint PPT Presentation

the computational sprinting game
SMART_READER_LITE
LIVE PREVIEW

The Computational Sprinting Game Songchun Fan , Seyed Majid Zahedi - - PowerPoint PPT Presentation

The Computational Sprinting Game Songchun Fan , Seyed Majid Zahedi , Benjamin C. Lee { songchun.fan, seyedmajid.zahedi, benjamin.c.lee } @duke.edu [ Co-First Authors] Computational Sprinting Supply extra power to enhance performance


slide-1
SLIDE 1

The Computational Sprinting Game

Songchun Fan∗, Seyed Majid Zahedi∗, Benjamin C. Lee

{songchun.fan, seyedmajid.zahedi, benjamin.c.lee}@duke.edu

[∗Co-First Authors]

slide-2
SLIDE 2

Computational Sprinting

  • Supply extra power to enhance performance for short durations
  • Activate more cores, boost voltage/frequency

2 / 25

slide-3
SLIDE 3

Computational Sprinting

  • Supply extra power to enhance performance for short durations
  • Activate more cores, boost voltage/frequency

Normalized Speedup 1 2 3 4 5 6 n a i v e d e c i s i

  • n

g r a d i e n t s v m l i n e a r k m e a n s a l s c

  • r

r e l a t i

  • n

p a g e r a n k c c t r i a n g l e Normalized Power 0.0 0.5 1.0 1.5 n a i v e d e c i s i

  • n

g r a d i e n t s v m l i n e a r k m e a n s a l s c

  • r

r e l a t i

  • n

p a g e r a n k c c t r i a n g l e Average Temperature (°C) 10 20 30 40 50 Non−sprinting Sprinting n a i v e d e c i s i

  • n

g r a d i e n t s v m l i n e a r k m e a n s a l s c

  • r

r e l a t i

  • n

p a g e r a n k c c t r i a n g l e 2 / 25

slide-4
SLIDE 4

Sprinting Architecture

  • Power for sprints supplied by shared rack
  • Heat from sprints absorbed by thermal packages
  • Fig. www.fortlax.se and Raghavan, Arun, et al. ”Computational sprinting on a hardware/software testbed.”

3 / 25

slide-5
SLIDE 5

Power Emergencies

3600 120 2 0.1 1 2 3 5 10 20

L

  • n

g

  • d

e l a y C

  • n

v e n t i

  • n

a l T r i p p i n g S h

  • r

t C i r c u i t

P =0

trip

P =1

trip

Tripped Non-deterministic Not Tripped T

  • lerance Band

Duration of Current Draw (sec) Current Normalized to Rated Current

  • Sprints may trip breaker
  • Current ↑ with sprinters
  • Time ↑ with sprint duration
  • Risk ↑ with current, time
  • Fig. Fu, Wang, and Lefurgy. ”How much power oversubscription is safe and allowed in data centers.”

4 / 25

slide-6
SLIDE 6

Uninterruptible Power Supplies

  • When sprints trip breaker,

draw on batteries

  • When sprints complete,

recharge batteries

  • Fig. www.amper-ecuador.com

5 / 25

slide-7
SLIDE 7

Example – Private Clouds

  • Applications compute on servers that share power
  • Processors sprint independently
  • Processors sprint selfishly for performance
  • Fig. Google, www.lasknet.net

6 / 25

slide-8
SLIDE 8

Sprinting Management

When should processors sprint?

  • Phases with higher performance from sprints
  • But sprints prohibited as chip cools

Which processors should sprint?

  • Processors that benefit most from sprints
  • But sprints prohibited as batteries recover

7 / 25

slide-9
SLIDE 9

Management Desiderata

Individual Performance

  • Sprints account for phase behavior
  • Sprints now constrain future sprints

System Stability

  • Sprints account for others’ sprinting strategies
  • Sprints risk power emergencies

8 / 25

slide-10
SLIDE 10

Sprinting Strategy

  • Optimize sprints given constraints
  • Sprint, wait ∆cooling for chip cooling
  • Sprint, wait ∆recovery for rack recovery if breaker trips
  • 5

10 15 20 25 30 5 6 7 8

Epoch Utility from Sprint

  • ?

9 / 25

slide-11
SLIDE 11

Sprinting Strategy

  • Optimize sprints given constraints
  • Sprint, wait ∆cooling for chip cooling
  • Sprint, wait ∆recovery for rack recovery if breaker trips
  • 5

10 15 20 25 30 5 6 7 8

Epoch Utility from Sprint

  • ×

× × × × × × × ×

9 / 25

slide-12
SLIDE 12

Game Theory

Study strategic agents

  • Agents selfishly maximize individual utility

Optimize responses

  • Response maximizes utility, given others’ strategies

Find equilibrium

  • State where all agents play their best responses

10 / 25

slide-13
SLIDE 13

Sprinting Game

States

  • Active – can sprint
  • Cooling – cannot sprint, chip cooling
  • Recovery – cannot sprint, batteries recharging

Actions

  • Sprint or not, when active

Strategies

  • Agent’s state, app’s phase, history, ...
  • Others’ strategies, utilities, and states, ...

11 / 25

slide-14
SLIDE 14

Mean Field Equilibrium (MFE)

Challenges

  • Large system with many agents
  • Complex strategies and many competitors
  • Intractable optimization for best response

Solution

  • Abstract many agents with statistical distributions
  • Optimize agents’ strategies against expectations

12 / 25

slide-15
SLIDE 15

Equilibrium Strategy

Agents maximize expected value of (not) sprinting

  • Current state
  • Utility from sprinting, u
  • Probability of tripping, Ptrip

Agents employ threshold strategy

  • If active and u ≥ uT, then sprint

13 / 25

slide-16
SLIDE 16

Find Equilibrium – Offline

  • Initialize probability of breaker trip Ptrip
  • Given Ptrip, optimize threshold strategy uT
  • Given uT, estimate number of sprinters N
  • Given N, update probability P′

trip

  • Iterate if P′

trip = Ptrip

14 / 25

slide-17
SLIDE 17

Execute Strategy – Online

If active and u ≥ uT, then sprint

15 / 25

slide-18
SLIDE 18

Sprinting Thresholds

2 3 4 5 6

0.0 0.1 0.2 0.3 0.4 Utility from Sprint Density Linear Regression

5 10 15

0.00 0.10 0.20 Utility from Sprint Density PageRank

  • Thresholds are optimal and diverse
  • Agents behave strategically to maximize performance

16 / 25

slide-19
SLIDE 19

Management Architecture

User Executor Engine T ask Agent Predictor User Executor Engine T ask Agent Predictor User Executor Engine T ask Agent Predictor

. . .

Coordinator Alg 1 Profile Strategy

  • Offline: coordinator profiles utility, optimizes thresholds
  • Online: predictors estimate sprint utility
  • Online: agents apply threshold strategy
  • Online: executor adapts computation

17 / 25

slide-20
SLIDE 20

Experimental Methodology

Sprinting

  • 3 cores @1.2GHz → 12 cores @ 2.7GHz

Workloads

  • Apache Spark
  • Spark engine dynamically schedules tasks on active cores

Performance Metric

  • Tasks completed per second (TPS)

Simulation Method

  • R-based simulator using traces of Spark computation

18 / 25

slide-21
SLIDE 21

Management Policies

Greedy

  • Sprint if neither cooling nor recovering

Exponential Back-off

  • Sprint if neither cooling nor recovering
  • Wait randomly for U[0, 2k] epochs after kth trip

Cooperative Threshold

  • Enforce globally optimized threshold

Equilibrium Threshold

  • Announce decentralized, strategic threshold

19 / 25

slide-22
SLIDE 22

Case for Equilibria

Equilibrium Cooperative Performance Stability

+ +

  • Cooperative (+): maximize global performance
  • Equilibrium (+): remove incentives to deviate

20 / 25

slide-23
SLIDE 23

Case for Equilibria

Equilibrium Cooperative Performance Stability

+

  • +
  • Cooperative (+): maximize global performance
  • Equilibrium (+): remove incentives to deviate
  • Cooperative (–): enforce strategies globally

20 / 25

slide-24
SLIDE 24

Case for Equilibria

Equilibrium Cooperative Performance Stability

+

  • +

+

  • Cooperative (+): maximize global performance
  • Equilibrium (+): remove incentives to deviate
  • Cooperative (–): enforce strategies globally
  • Equilibrium (+): maximize individual performance

20 / 25

slide-25
SLIDE 25

Sprinting Behavior

300 600 Greedy 300 600

Number of Sprinting Users

Exponential Backoff 300 600 Cooperative Threshold 200 400 600 800 1000 300 600

Epoch Index

Equilibirum Threshold 21 / 25

slide-26
SLIDE 26

Sprinting Performance

Performance (Normalized to Greedy)

1 2 3 4 5 6

n a i v e d e c i s i

  • n

g r a d i e n t s v m l i n e a r k m e a n s a l s c

  • r

r e l a t i

  • n

p a g e r a n k c c t r i a n g l e Greedy Exponential Backoff Equilibrium Threshold Cooperative Threshold

  • Greedy – aggressive, incurs emergencies
  • Exponential – conservative, untimely sprints
  • Equilibrium – strategic, produces equilibrium
  • Cooperative – optimal, requires enforcement

22 / 25

slide-27
SLIDE 27

Game States

Greedy Exponential Equilibrium Cooperative 0% 25% 50% 75% 100%

Active (not sprinting) Local cooling Global recovery Sprinting

  • Greedy – time in recovery
  • Exponential – untimely sprints
  • Equilibrium – timely sprints
  • Cooperative – timely sprints

23 / 25

slide-28
SLIDE 28

Conclusion

Management with game theory

  • Agents sprint according to threshold – inexpensive
  • Agents have no incentives to deviate – stable
  • Agents optimize response – high performance

Future directions

  • Use game theory to manage scarce resources
  • E.g., big/small processors, accelerators

24 / 25

slide-29
SLIDE 29

Thank you

Questions?

25 / 25