How to Make Decisions (Optimally)
Siddhartha Sen Microsoft Research NYC
How to Make Decisions (Optimally) Siddhartha Sen Microsoft - - PowerPoint PPT Presentation
How to Make Decisions (Optimally) Siddhartha Sen Microsoft Research NYC AI for Systems Vision: Infuse AI to optimize cloud infrastructure decisions, while being: Minimally disruptive (agenda: Harvesting Randomness ) Synergistic with
Siddhartha Sen Microsoft Research NYC
Fuzzer Fuzzer Optimizer Optimizer Safeguard Safeguard
(complex)
(complex)
Evaluate alternatives without disrupting?
Reinforcement Learning
Counterfactual Evaluation
context policy reward action
Which policy maximizes my total reward?
reward
reward
Which policy maximizes my total reward?
user, browse history clicked/ignored article on top
machine, failure hist total downtime wait time before reboot
weather, traffic trip time, cost bike, subway, car
user, dating hist length of relationship match
gives you the full answer train a model:
dog, cat, …
context reward action
, only gives a partial answer train a policy:
Policy B (Location)
Clicked
Policy A (Career)
Ignored
This is an A/B test!
Humans are bad at this Humans are bad at this Humans are bad at this Humans are bad at this
RL: richer policy space, richer representation
Policy B (Location)
Clicked
Policy A (Career)
Ignored
Giant table Policy space
subtle, subconscious overtaking
Ignored
Policy B (e.g. Location)
Clicked
Policy A (e.g. Career)
Giant table Policy space
Ignored
Policy B (e.g. Location)
Clicked
Policy A (e.g. Career)
Giant table Policy space
Instead: randomize directly over actions Collect data first, then evaluate policies after‐the‐fact
Ignored Clicked
Giant table
Policy B (e.g. Location) Policy A (e.g. Career)
Policy space
Problem: randomizing over policies
Engineer Engineer Engineer Teacher
Clicked Clicked Ignored Clicked Texas Seattle Seattle Texas Male Female Male Female
have performed if I had run it?
match/evaluate decisions the new policy would make:
have performed if I had run it?
match/evaluate decisions the new policy would make:
Use probabilities to
best one
Can we apply this paradigm to cloud systems? Can we apply this paradigm to cloud systems?
machine, failure hist total downtime wait time before reboot
OS, locale, traffic type CPU utilization TCP parameter settings
req, replica loads latency replica to handle request
Counterfactual evaluation! Counterfactual evaluation!
Free exploration!
Free feedback!
Challenge Technique Mess of methods/techniques spanning multiple disciplines Taxonomy Huge action spaces (coverage) Spatial coarsening Stateful, non‐independent decisions Temporal coarsening, Time horizons Dynamic environments (Baseline normalization)
Feedback? Feedback? Randomization? Randomization? Independent decisions? Independent decisions?
Direct method ?
Full Partial No Yes Randomize/explore
Unbiased estimator (DR)
Yes
Unbiased estimator + time horizon (DR‐Time)
No
Reinforcement Learning (contextual bandits) Supervised Learning Reinforcement Learning (general)
Spatial coarsening
Decision? Action [‐]Reward Machine A Wait 10 min 5 min Machine B Wait 10 min 3 min Machine C Wait 10 min 10 min + reboot … … …
Decision? Action [‐]Reward Feedback Machine A Wait 10 min 5 min Wait 1,2,…,9 Machine B Wait 10 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …
Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 3 min Wait 1,2,…,9 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …
Decision? Action [‐]Reward Feedback Machine A Wait 6 min 5 min Wait 1,2,…,9 Machine B Wait 2 min 2 min + reboot Wait 1 Machine C Wait 10 min 10 min + reboot Wait 1,2,…,9 … … … …
Implicit feedback
DR DR + Implicit feedback
WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster
Atlanta, USA Mumbai, India
Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint
configurations, per hour per machine
WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster
Atlanta, USA Mumbai, India
Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint
Spatial/temporal coarsening
environment
to each RL machine as baseline, report delta
WAN Cloud Datacenter Clients req resp Edge proxy cluster Edge proxy cluster
Atlanta, USA Mumbai, India
Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint
WAN Cloud Datacenter Clients req resp
Atlanta, USA Mumbai, India
Clients Service 1 endpoint Service 1 endpoint Service 2 endpoint
environment
to each RL machine as baseline, report delta
Baseline normalization
Estimate Reward Error Ground truth 0.713 ‐‐ DR 0.720 (0.637, 0.796) 0.97%
Configuration
1 2 (default) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Configuration
1 2 (default) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
each request
Natural randomness
server 1?
decisions interact over time!
Decision? Action [‐]Reward Request A Replica 1 1 ms Request B Replica 2 2 ms Request C Replica 1 1 ms … … …
Time horizons
hash(key) % num_replicas
Latency DR Truth DR‐Time Random
reason about systems!
decisions, dynamic environments
not reflect performance at full scale
(coverage)
construction, station closures (dynamic environment)
(coverage)
not enough
psychology
natural exploration
Dan Goldstein