How to Make Decisions (Optimally) Siddhartha Sen Microsoft - - PowerPoint PPT Presentation

how to make decisions optimally
SMART_READER_LITE
LIVE PREVIEW

How to Make Decisions (Optimally) Siddhartha Sen Microsoft - - PowerPoint PPT Presentation

How to Make Decisions (Optimally) Siddhartha Sen Microsoft Research NYC AI for Systems Vision: Infuse AI to optimize cloud infrastructure decisions, while being: Minimally disruptive (agenda: Harvesting Randomness ) Synergistic with


slide-1
SLIDE 1

How to Make Decisions (Optimally)

Siddhartha Sen Microsoft Research NYC

slide-2
SLIDE 2

AI for Systems

  • Vision: Infuse AI to optimize cloud infrastructure decisions, while being:
  • Minimally disruptive (agenda: Harvesting Randomness)
  • Synergistic with human solutions (agenda: HAIbrid algorithms)
  • Safe and reliable (agenda: Safeguards)
  • Impact: Above criteria differentiate us, ensure wider‐spread impact
  • Team:
  • MSR NYC, MSR India
  • Azure: Azure Compute, Azure Frontdoor
  • Universities: Columbia, NYU, Princeton, Yale, UBC, U. Chicago, Cornell
slide-3
SLIDE 3

Vision: Safe optimization without disruption

Fuzzer Fuzzer Optimizer Optimizer Safeguard Safeguard

System

(complex)

System

(complex)

Evaluate alternatives without disrupting?

slide-4
SLIDE 4

Roadmap

  • A framework for making systematic decisions:

Reinforcement Learning

  • A way to reason about decisions in the past:

Counterfactual Evaluation

  • How to make this work in cloud systems?
  • Successes, fundamental obstacles, workarounds
slide-5
SLIDE 5

Decisions in the real world

context policy reward action

Which policy maximizes my total reward?

reward

slide-6
SLIDE 6

Reinforcement learning (RL)

reward

Which policy maximizes my total reward?

slide-7
SLIDE 7

Example: online news articles (MSN)

user, browse history clicked/ignored article on top

slide-8
SLIDE 8

Example: machine health (Azure cloud)

machine, failure hist total downtime wait time before reboot

slide-9
SLIDE 9

Example: commute options

weather, traffic trip time, cost bike, subway, car

slide-10
SLIDE 10

Example: online dating

user, dating hist length of relationship match

slide-11
SLIDE 11

Reinforcement learning reflects real life

  • Traditional (supervised) machine learning needs the answer as input

gives you the full answer train a model: 

 dog, cat, …

slide-12
SLIDE 12

Reinforcement learning reflects real life

  • RL interacts with environment, learns from feedback

context reward action

, only gives a partial answer train a policy: 

slide-13
SLIDE 13

How to learn in an RL setting?

  • Explore to learn about new actions
  • Incorporate reward feedback
  • Do this systematically! (Humans are not good at this)
slide-14
SLIDE 14

Policy B (Location)

Clicked

Policy A (Career)

Ignored

Simple example: online news articles

This is an A/B test!

Humans are bad at this Humans are bad at this Humans are bad at this Humans are bad at this

slide-15
SLIDE 15

RL: richer policy space, richer representation

Policy B (Location)

Clicked

Policy A (Career)

Ignored

Simple example: online news articles

Giant table Policy space

slide-16
SLIDE 16

Aside: Deep Reinforcement Learning!

  • Superhuman ability in Go, Chess
  • Lots of engineering/tweaking
  • Learning from self‐play not new
  • Far from AI apocalypse
  • But (opinion): a glimpse of a more

subtle, subconscious overtaking

slide-17
SLIDE 17

Ignored

Policy B (e.g. Location)

Clicked

Policy A (e.g. Career)

Giant table Policy space

slide-18
SLIDE 18

Ignored

Testing policies online is inefficient

  • Costly (prod deployment)
  • Risky (live user traffic)
  • Slow (split 100% of traffic)

Policy B (e.g. Location)

Clicked

Policy A (e.g. Career)

Giant table Policy space

slide-19
SLIDE 19

Instead: randomize directly over actions Collect data first, then evaluate policies after‐the‐fact

Testing policies online is inefficient

Ignored Clicked

Giant table

Policy B (e.g. Location) Policy A (e.g. Career)

Policy space

Problem: randomizing over policies

slide-20
SLIDE 20

Later evaluate Career policy: Later evaluate Location policy: Later evaluate Gender policy:

Engineer Engineer Engineer Teacher

Clicked Clicked Ignored Clicked Texas Seattle Seattle Texas Male Female Male Female

Test policies offline!

slide-21
SLIDE 21

Counterfactual evaluation (testing policies offline)

  • Ask “what if” questions about the past: how would this new policy

have performed if I had run it?

  • Basic idea: Use (randomized) decisions made by a deployed policy to

match/evaluate decisions the new policy would make:

  • Problem: deployed policy’s decisions may be biased
slide-22
SLIDE 22
  • Ask “what if” questions about the past: how would this new policy

have performed if I had run it?

  • Basic idea: Use (randomized) decisions made by a deployed policy to

match/evaluate decisions the new policy would make:

  • Test many different policies on the same dataset, offline!
  • Counterfactual evaluation (testing policies offline)

Use probabilities to

  • ver/underweight decisions
slide-23
SLIDE 23

RL + Counterfactual Evaluation

  • Very powerful combination: evaluate a billion policies offline, find the

best one

  • Exponential boost over online A/B testing

Can we apply this paradigm to cloud systems? Can we apply this paradigm to cloud systems?

slide-24
SLIDE 24

Example: machine health (Azure Compute)

machine, failure hist total downtime wait time before reboot

slide-25
SLIDE 25

Example: TCP config (Azure Frontdoor)

OS, locale, traffic type CPU utilization TCP parameter settings

slide-26
SLIDE 26

Example: replica selection (Azure LB)

req, replica loads latency replica to handle request

slide-27
SLIDE 27

What if…

  • … we waited a different amount of time before rebooting?
  • … we used different TCP settings on an edge proxy machine?
  • … we sent a request to a different replica?

Counterfactual evaluation! Counterfactual evaluation!

slide-28
SLIDE 28

Counterfactual evaluation in Systems

  • Opportunity: Many systems are naturally randomized
  • Load balancing, data replicas, cache eviction, fault handling, etc.
  • When we need to spread things, when choices are ambiguous

 Free exploration!

  • Opportunity: Many systems provide implicit feedback
  • Naïve defaults, conservative parameter settings
  • Worse settings yield more information

 Free feedback!

slide-29
SLIDE 29

Counterfactual evaluation in Systems

Challenge Technique Mess of methods/techniques spanning multiple disciplines Taxonomy Huge action spaces (coverage) Spatial coarsening Stateful, non‐independent decisions Temporal coarsening, Time horizons Dynamic environments (Baseline normalization)

slide-30
SLIDE 30

Taxonomy for counterfactual evaluation

Feedback? Feedback? Randomization? Randomization? Independent decisions? Independent decisions?

Direct method ?

Full Partial No Yes Randomize/explore

Unbiased estimator (DR)

Yes

Unbiased estimator + time horizon (DR‐Time)

No

Reinforcement Learning (contextual bandits) Supervised Learning Reinforcement Learning (general)

slide-31
SLIDE 31

Example: Machine health in Azure Compute

  • Wait for some time, then reboot
slide-32
SLIDE 32

Example: Machine health in Azure Compute

  • Wait for some time, then reboot
  • Wait for {1,2,…,10 min}

Spatial coarsening

slide-33
SLIDE 33

Example: Machine health in Azure Compute

Decision? Action [‐]Reward Machine A Wait 10 min 5 min Machine B Wait 10 min 3 min Machine C Wait 10 min 10 min + reboot … … …

slide-34
SLIDE 34

Example: Machine health in Azure Compute

Decision? Action [‐]Reward Feedback Machine A Wait 10 min 5 min Wait 1,2,…,9 Machine B Wait 10 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …

slide-35
SLIDE 35

Example: Machine health in Azure Compute

Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …

slide-36
SLIDE 36

Example: Machine health in Azure Compute

Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 2 min + reboot Wait 1 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …

Implicit feedback

slide-37
SLIDE 37

Results: Machine health in Azure Compute

DR DR + Implicit feedback

slide-38
SLIDE 38

Results: Machine health in Azure Compute

slide-39
SLIDE 39

Example: TCP config in Azure Frontdoor

WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster

Atlanta, USA Mumbai, India

Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint

  • TCP parameters:
  • initial cwnd
  • initial RTO
  • min RTO
  • max SYN retransmit
  • delayed ACK freq
  • delayed ACK timeout
slide-40
SLIDE 40

Example: TCP config in Azure Frontdoor

  • TCP parameters:
  • initial cwnd
  • initial RTO
  • min RTO
  • max SYN retransmit
  • delayed ACK freq
  • delayed ACK timeout
  • Pick from 17 different

configurations, per hour per machine

WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster

Atlanta, USA Mumbai, India

Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint

Spatial/temporal coarsening

slide-41
SLIDE 41

Example: TCP config in Azure Frontdoor

  • Dynamic workload and

environment

  • Assign “control” machine

to each RL machine as baseline, report delta

WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster

Atlanta, USA Mumbai, India

Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint

slide-42
SLIDE 42

Example: TCP config in Azure Frontdoor

WAN Cloud Datacenter Clients req resp

Atlanta, USA Mumbai, India

Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint

  • Dynamic workload and

environment

  • Assign “control” machine

to each RL machine as baseline, report delta

Baseline normalization

slide-43
SLIDE 43

Results: TCP config in Azure Frontdoor

Estimate Reward Error Ground truth 0.713 ‐‐ DR 0.720 (0.637, 0.796) 0.97%

slide-44
SLIDE 44

Lesson: Unbiased estimator vs. biased policy

Configuration

1 2 (default) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

slide-45
SLIDE 45

Lesson: Unbiased estimator vs. biased policy

Configuration

1 2 (default) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

slide-46
SLIDE 46

Lesson: Unbiased estimator vs. biased policy

slide-47
SLIDE 47

Example: Replica selection

  • Choose replica to process

each request

Natural randomness

slide-48
SLIDE 48

Example: Replica selection

  • New policy: always send to

server 1?

  • Problem: non‐independent

decisions interact over time!

Decision? Action [‐]Reward Request A Replica 1 1 ms Request B Replica 2 2 ms Request C Replica 1 1 ms … … …

Time horizons

slide-49
SLIDE 49

Results: Replica selection

  • Run Zipf workload, collect data
  • Evaluate new policy:

hash(key) % num_replicas

Latency DR Truth DR‐Time Random

slide-50
SLIDE 50

Takeaways

  • RL + Counterfactual evaluation is a powerful paradigm; use it to

reason about systems!

  • Opportunities: natural randomness, implicit feedback
  • Challenges: huge action spaces (coverage), non‐independent

decisions, dynamic environments

  • Fundamental problem: experimenting at small scale/fraction of traffic may

not reflect performance at full scale

slide-51
SLIDE 51

Challenges in real‐life decisions

  • Commute options
  • Only one datapoint per day

(coverage)

  • Changing traffic patterns,

construction, station closures (dynamic environment)

slide-52
SLIDE 52

Challenges in real‐life decisions

  • Online dating
  • Few datapoints, exploration costly

(coverage)

  • Very unromantic
slide-53
SLIDE 53

Can we (should we) optimize real life?

  • RL + counterfactual evaluation is

not enough

  • Combine with behavioral

psychology

  • Model behavior, learn from
  • thers, avoid behavioral traps
  • Automate Dan
  • But: allow individuality, leverage

natural exploration

Dan Goldstein